Google's next-generation TPU is about to be released: a critical strike against NVIDIA in the era of AI inference

Google is set to launch the next generation of TPU at the Google Cloud Next conference, marking the arrival of the AI inference era. As the competition for AI computing power shifts from training to inference, Google plans to challenge NVIDIA's dominance in the AI chip market with its self-developed TPU. The launch of TPU aims to meet the demand for cloud AI inference computing power, enhance cost-effectiveness, and promote the widespread adoption of AI applications

As the AI computing power battlefield shifts from training to inference, Google (GOOGL.US) is preparing to deliver a key blow. According to Zhitong Finance APP, the company plans to announce its next-generation customized AI chip—Tensor Processing Unit (TPU)—at the Google Cloud Next conference to be held this week in Las Vegas. Amin Vahdat, who is responsible for Google's AI computing infrastructure and chip development, declined to comment on a chip plan that could accelerate AI output speed but indicated that more information is likely to be shared in the "relatively near future."

The backdrop of this signal release is the structural shift in the global AI computing power competition—from a focus on model training to a dominance of large-scale inference. As the adoption of AI application software and AI Agents surges, the metrics for measuring computing power are shifting from "peak performance" to "cost per token, latency, and energy efficiency." This is precisely the area where the AI ASIC route represented by TPU has the greatest advantage.

TPU Accelerates Out of the Circle: Google Launches a Substantial Challenge to NVIDIA's Computing Power Dominance

In light of this trend, Google is attempting to launch a direct challenge to NVIDIA, which currently holds about 80% to 90% of the AI chip market, with its self-developed TPU system.

In just a few months, the TPU AI chips exclusively developed and massively deployed in Google's data centers have become one of the hottest commodities in the global tech industry. Leading AI technology developers, including some of the largest competitors of this tech company, are stockpiling these chips. Following the full arrival of the AI inference era, the demand for cloud AI inference computing power has surged, and the trend of "AI micro-training," which focuses on embedding large AI models into business operations, has made Google's exclusive TPU AI computing power system a strong challenge to NVIDIA's nearly 90% market share monopoly.

Now, this tech giant under Alphabet Inc. hopes to continue to ramp up its existing growth momentum with the upcoming launch of a new AI accelerator chip specifically designed for the AI inference wave.

The global surge in generative AI and AI Agent deployment has accelerated the AI chip development process among cloud computing and chip giants, who are competing to design the fastest and most energy-efficient AI computing power infrastructure clusters for advanced large AI data centers. Broadcom and its largest competitor Marvell are primarily focused on leveraging their absolute advantages in high-speed interconnect and chip IP to collaborate with cloud computing giants like Amazon, Google, and Microsoft to create AI ASIC computing clusters tailored to the specific needs of their AI data centers. This ASIC business has grown into a very important business for both companies, which is why Marvell and Broadcom's stock prices have surged this year; for example, the TPU AI computing cluster created by Broadcom in partnership with Google is a typical example of the AI ASIC technology route.

There is no doubt that significant constraints in terms of economics and power are forcing Microsoft, Amazon, Google, and Facebook's parent company Meta to push for the AI ASIC technology route in their cloud computing internal systems by developing their own AI chips, with the core goal of making AI computing clusters more cost-effective and energy-efficient The construction costs of ultra-large-scale AI data centers, similar to "Stargate," are high, leading technology giants to increasingly demand that AI computing power systems become more economical. Under power constraints, these giants strive to optimize "unit token cost and unit watt output," marking the arrival of a prosperous era for AI ASIC technology.

Additionally, the long-term supply shortage and high costs of advanced AI GPU computing clusters, such as NVIDIA's Blackwell architecture, are constrained by supply chain bottlenecks and delivery rhythms. Self-developed AI ASICs can undoubtedly provide "second curve capacity," allowing for more proactive procurement negotiations, product pricing, and gross margins in cloud computing services. Furthermore, cloud computing giants like Google and Microsoft can integrate the design of "chips - interconnects - systems - compilers/runtime - scheduling - observation/reliability," improving the utilization of computing infrastructure and reducing TCO.

The AI training side, which is almost monopolized by NVIDIA's AI GPUs, requires more powerful general-purpose AI computing clusters and rapid iteration capabilities across the entire computing system. In contrast, the AI inference side places greater emphasis on unit token costs, latency, and energy efficiency after the large-scale implementation of cutting-edge AI technologies. For instance, Google has clearly positioned the Ironwood TPU as a TPU generation "born for the AI inference era," emphasizing performance/energy efficiency/cost-effectiveness of computing clusters and scalability. However, Amazon's latest actions have demonstrated that AI ASICs may possess strong potential for training large models.

The AI ASIC computing system will undoubtedly continue to weaken NVIDIA's monopoly premium and some market share in the medium to long term, rather than linearly replacing the GPU system. The fundamental underlying reason is that the core competition in the inference era is no longer just "peak computing power," but rather the cost per token, power consumption, memory bandwidth utilization, interconnect efficiency, and total cost of ownership after hardware-software collaboration. In terms of these metrics, ASICs customized for specific workloads with tailored data flows, compilers, and interconnects are inherently more cost-effective than general-purpose GPUs. In the future, it is more likely that cutting-edge training and general cloud computing power will continue to be dominated by GPUs, while ultra-large-scale internal inference, agent workflows, and fixed high-frequency loads will accelerate the shift towards ASICs, ushering data centers into a true heterogeneous computing era.

Overnight Fame is Actually a Decade in the Making: How TPU Transformed from an Internal Tool at Google to a Hard Currency in the Global Tech Industry

Google's long-term chip efforts gained unprecedented attention last October when Anthropic PBC, the AI model developer with the Claude AI large model that has garnered significant investor interest, announced an expansion of its computing power supply agreement, securing the use of up to 1 million Google TPUs. The following month, Google launched the more advanced Gemini 3 model, announcing that it had completed training and operation on a portion of the TPU computing power platform, receiving widespread acclaim.

Since then, the demand for Google TPU chips from large enterprises has only increased. Meta Platforms Inc., the parent company of Facebook, signed a multi-year, multi-billion dollar AI computing infrastructure supply agreement to use TPUs through Google Cloud Santosh Janardhan, head of infrastructure at Meta, stated that the company recently obtained a large supply of cloud TPU AI computing power for the first time and is testing these chips to assess which tasks they are best suited for. "There does seem to be a potential exclusive advantage in reasoning," he said, while also noting that "no new platform comes without obstacles and a learning curve."

Anthropic has also signed a long-term agreement with Broadcom, Google's TPU partner, involving self-developed chips that will enable it to utilize approximately 3.5 gigawatts of computing power starting in 2027. Citadel Securities plans to showcase at the Google conference how TPU allows the company to train large AI models faster than when using GPUs previously. Abu Dhabi technology group G42 has also had "multiple discussions" with Google about using Google TPU, according to Talal Al Kaissi, interim CEO of the group's cloud computing division, Core42. "I am very optimistic," Al Kaissi said when discussing these discussions.

Google is already taking new steps to meet the current practical cloud AI computing power needs of its customers. According to a knowledgeable source, the company is testing a setup that allows companies like Anthropic to run some of their TPUs within their own physical large AI data centers, rather than solely relying on Google's cloud computing infrastructure. Vahdat stated that Google has also allowed TPU customers to use external tools like PyTorch and other scheduling software, rather than relying solely on Google's own products.

These changes are helping to alter the external perception of these chips. They were initially born out of Google's own AI computing bottlenecks and have long been viewed primarily as a means to meet the company's own needs for internal use.

After Google Chief Scientist Dean began building an earlier AI software super system to enable language translation and speech recognition services, he also realized that even Google could not afford the costs of providing such services using existing chips and other hardware infrastructure. This is why Google, despite having its own TPU computing power system, continues to invest in NVIDIA AI computing and broadly general-purpose AI GPU computing power systems. Meanwhile, the performance improvements/speed enhancements of the central processing units that Google relies on for AI are also slowing down.

The company decided to create an AI computing accelerator focused on a narrower set of tasks, which may incur the highest bills in the AI field. Vahdat stated that the key idea behind TPU is that it "addresses a few specific problems, but the amount of additional computing or general computing required for those problems is extremely large." Vahdat, a former computer science professor, played a key role in promoting Google's adoption of optical switches (i.e., OCS optical circuit switch systems) that help connect TPUs into supercomputers. "The conventional wisdom at the time was that you didn't need to build dedicated hardware." Over the years, Google's TPU has evolved in sync with its AI research efforts. A groundbreaking research paper from Google in 2017 gave rise to today's large language models and prompted the TPU team to focus on designing chips for training larger AI systems. Later, Google DeepMind and the chip team noticed that when TPUs were used for reinforcement learning—a popular method for improving AI systems' actual performance on specific tasks—TPUs often had excessive idle time. The TPU team then adjusted the network connections between various semiconductors to accelerate data flow and avoid chip idleness.

This dynamic adjustment continues today, as Google weighs how many chips to connect in a single pod or whether hardware can reduce precision to save costs. "A lot of these things are guided by experiments with large AI models," Hassabis said. Looking ahead, he is very hopeful that the TPU research team will consider building an accelerator suitable for edge computing scenarios—placing chips closer to users rather than accessing them through the cloud to further reduce latency.

In this process, Google has also built an exclusive internal AI verification system to quickly identify manufacturing defects, as these defects can have a disproportionately large impact on application software. When the entire computing system collaborates deeply with AI acceleration chips that handle massive mathematical computations, even a minor fault can quickly spread and lead to the model "completely self-destructing," said Paul Barham, a distinguished scientist at Google and co-lead of the Gemini infrastructure team. He noted that Google experienced such an issue about two years ago, taking weeks to clarify what had happened, describing it as a "bug from hell."

"We now have to complete this work within 10 seconds for hundreds of thousands of accelerator chips," he said.

The Ultimate Challenge Amid an Unprecedented AI Inference Boom: Supply, Technology Roadmap, and the Risk of "Technology Islands"

Despite having extensive experience in developing large AI models, Google still faces challenges similar to those of other fabless chip giants like NVIDIA, AMD, and Broadcom: chips typically take about three years from inception to completion, but the pace of AI model evolution is much faster. This makes it difficult to predict what customers will want years down the line.

"If someone claims they know what Gemini 10 will look like, I would just say, 'Please share some of what you just smoked with me,'" Ranganathan said.

Barham is also concerned that the tight feedback loop between AI model creators and hardware designers carries the risk of missing out on new ideas. He stated that there exists "a cycle that can trap you in the current software and hardware operating well."

Gradient Canopy, a building located in Google's Mountain View campus, is where Google's AI experts and chip designers often meet and share ideas. To strike a balance, the TPU computing power system development team sometimes strives to make chips "good enough" for various uses, even if they are not perfect for each application. Vahdat noted that another option is to plan for two different designs They may not all ship, but if their respective use cases are attractive enough, it is possible that they will all ship.

As Google chips become increasingly popular, the company faces supply constraints similar to those of NVIDIA. An executive from a startup, speaking anonymously, stated that their company's use of TPUs has been limited by supply availability in order to discuss internal matters, and complained that Google has effectively allocated all immediately available TPU AI chips to Anthropic.

“To a large extent, we are indeed prioritizing existing supply for those more elite teams, because clearly, these teams may be the ones who can best leverage what TPUs excel at,” Hassabis said when referring to those top AI companies. In the future, Google will also need to decide how to allocate TPUs between its own growing competitive nature of AI large model infrastructure services and its expanding customer list.

“Manufacturing TPUs exclusively for Google does have some advantages, but there are also substantial downsides,” Vahdat said. “Ultimately, you will end up on what we call a ‘technology island.’ It may be a beautiful island, but its population will be limited, and diversity will also be limited. In the end, it is likely to become less good.”