Behind NVIDIA's "mysterious chip" – The era of inference begins with "four new trends in computing power"

NVIDIA will integrate LPU technology to launch a new inference chip, with OpenAI investing heavily, marking the shift of the AI computing power battleground from training to inference. Shenwan Hongyuan Research points out that the era of inference is giving rise to four new trends: an increase in CPU deployment scenarios, the rise of LPU-specific architectures, accelerated breakthroughs in domestic chips, and a shift in computing power demand from training to massive token consumption. As chips move towards a division of labor between training and inference, and systems evolve towards a three-layer architecture, high-cost-performance inference chip manufacturers will become the biggest beneficiaries

NVIDIA is integrating LPU (Language Processing Unit) technology and OpenAI's multi-line inference chips, shifting the main battleground of AI computing power competition from training to inference. Shenwan Hongyuan Research believes that the core keyword of the computing power industry in 2026 will be inference, with the total token consumption and technological paradigms deeply reconstructed around this theme.

On February 28, according to The Wall Street Journal, NVIDIA plans to unveil a new inference chip that integrates Groq's "Language Processing Unit" (LPU) technology at next month's GTC developer conference. NVIDIA CEO Jensen Huang described it as a "new system the world has never seen before." OpenAI has agreed to become one of the largest customers for this processor and will purchase large-scale "dedicated inference capacity" from NVIDIA.

Meanwhile, last month, OpenAI also reached a multi-billion dollar computing collaboration with startup Cerebras, which claims its inference chip speed has surpassed NVIDIA's GPUs (Graphics Processing Units). This series of developments indicates that AI giants are shifting from an arms race in training computing power to a multi-line layout in inference computing power.

The Shenwan Hongyuan report points out that in the era of token economy, inference computing power is witnessing four major trends: first, an increase in pure CPU (Central Processing Unit) deployment scenarios, accelerating the descent of computing power due to low-cost inference demand; second, the rise of dedicated architectures like LPU, challenging the dominant position of GPUs in inference; third, domestic computing power chips are accelerating breakthroughs, with a clear trend towards supply chain diversification; fourth, the demand structure for inference computing power is shifting from "single training" to "massive token consumption," with cost-effectiveness becoming a core competitive factor.

The report states that vendors capable of providing sufficient, high-cost-performance inference chips will benefit the most, and the joint breakthroughs of CPU, LPU, and domestic chips are forming the core clues for this round of reshaping the computing power landscape.

Inference Demand Explodes, Token Consumption Hits Record High

Shenwan Hongyuan Research believes that behind the continuous expansion of demand are two structural driving forces: first, the acceleration of monetization of large models, with models like Claude beginning to penetrate the application side and releasing multiple industry plugins; second, the acceleration of Agent implementation, with products like openclaw and Qianwen Agent marking the entry of Agents into real work and production scenarios, where each model invocation and Agent task execution requires substantial inference computing power support.

Shenwan Hongyuan Research cites data showing that during the Spring Festival, the inference volume of leading domestic models significantly increased: Doubao's inference throughput reached 63.3 billion tokens on New Year's Eve, Yuangbao had 114 million monthly active users, and over 120 million people participated in Qianwen's "Spring Festival Free Order" event.

Data from OpenRouter, a global AI model API aggregation platform, further reveals the scale of this trend. From February 9 to 15, Chinese models surpassed American models for the first time with a calling volume of 41.2 trillion tokens compared to 29.4 trillion tokens; from February 16 to 22, the calling volume of Chinese models further surged to 51.6 trillion tokens, a 127% increase over three weeks, with four out of the top five models in global calling volume being Chinese

LPU Becomes a New Star, Training and Inference Chips Move Towards Differentiation

NVIDIA spent $20 billion to acquire the core technology license from Groq and absorbed the executive team, including founder Jonathan Ross, in a "core hiring" deal. Shenwan Hongyuan Research believes that this transaction marks the formal recognition of the importance of pure inference chips by top players.

The architectural differences between LPU and traditional GPU are the fundamental reasons for its efficiency advantage in inference scenarios. AI inference is divided into two stages: pre-filling and decoding, with the decoding process of large models being particularly slow. LPU has been specifically optimized for the two major inference bottlenecks: latency and memory bandwidth. According to previous reports from Wall Street News, NVIDIA's upcoming products may involve the next-generation Feynman architecture, possibly adopting a more extensive SRAM integration scheme, or even deeply integrating LPU through 3D stacking technology.

Based on this, Shenwan Hongyuan Research predicts that future AI chips will form a clear technological division of labor: the training end will continue to use the GPU-HBM combination, while the inference end will evolve into a combination of ASIC+LPU-SRAM+SSD. As computing power demand shifts from training to inference, manufacturers focusing on inference chips will seize development opportunities.

Comprehensive Innovation in Inference Systems, CPU and Network Demand Rise Simultaneously

The shift from a single chip to system-level innovation is another important dimension of this round of inference computing power upgrades. Shenwan Hongyuan Research points out that as application scenarios shift from chatbots to Agents, the requirements for latency, throughput, and depth of thought in computing systems are simultaneously increasing, driving the evolution of system architecture towards a three-layer network.

The first layer is the fast response layer, which provides extremely low-latency feedback through pure inference chips equipped with SRAM; the second layer is the slow thinking layer, which uses ultra-high throughput computing clusters to handle complex logical deductions, with demand for multi-core multi-threaded CPUs significantly increasing at this layer; the third layer is the memory layer, corresponding to NVIDIA's Context Memory System, which manages the long-term memory and KV Cache of Agents through Bluefield4 DPU-managed SSD storage.

NVIDIA is also adjusting its strategy at the hardware level. The previous standard practice of bundling Vera CPU with Rubin GPU has proven to be too costly under specific AI agent workloads. This month, NVIDIA announced an expansion of its cooperation with Meta Platforms, completing the first large-scale pure CPU deployment to support Meta's advertising-targeting AI agents, marking the company's move beyond a single GPU sales model.

Domestic Computing Power Accelerates Breakthrough

Shenwan Hongyuan Research believes that the technological upgrades of domestic inference chips are worth paying close attention to, and there is a discrepancy in market expectations.

On the technical level, the new generation of domestic inference chips has achieved several fundamental improvements: it adds support for low-precision data formats such as FP8/MXFP8/MXFP4, with computing power reaching 1P and 2P respectively; significantly enhances vector computing power, adopting a new homogeneous design that supports SIMD/SIMT dual programming models; and interconnect bandwidth has increased by 2.5 times compared to the previous generation, reaching 2TB/s.

Notably, the chip level has achieved PD separation: through self-developed two different specifications of HBM, it forms the PR version aimed at Prefill and recommendation scenarios, and the DT version aimed at Decode and training scenarios. The PR version uses low-cost HBM, which can significantly reduce the investment cost in the inference Prefill stage, and is expected to be launched in Q1 2026.

On the supply chain level, the progress of domestic packaging and testing manufacturers provides evidence. According to the first round of inquiry response from a leading packaging and testing company, its 2.5D packaging business revenue mainly comes from high-performance computing chip packaging services, which has rapidly grown from 50 million yuan in 2022 to 1.82 billion yuan in 2024, indirectly confirming that the supply capacity of domestic computing power chips continues to improve, and the process of supply chain localization is accelerating.