TrendForce: NVIDIA Blackwell's high energy consumption drives demand for heat dissipation, with an estimated liquid cooling penetration rate reaching 10% by the end of 2024

Zhitong
2024.07.30 05:51
portai
I'm PortAI, I can summarize articles.

NVIDIA plans to launch a new generation platform, Blackwell, by the end of 2024 and establish an AI Server data center, expecting to drive the penetration rate of liquid cooling solutions to 10%. The Blackwell platform will replace the existing Hopper platform as the main solution for NVIDIA's high-end GPUs, accounting for nearly 83% of the overall high-end products. In the pursuit of high-performance AI Server models, the power consumption of a single GPU can exceed 1,000W, which will promote the growth of the AI Server liquid cooling supply chain. According to TrendForce, the thermal design power of server chips continues to increase, and traditional air cooling solutions are insufficient to meet the demand

According to the information obtained by Zhitong Finance APP, on July 30th, TrendForce stated in a post that with the growth in demand for high-speed computing, more effective AI Server cooling solutions are also receiving attention. According to the latest AI Server report from TrendForce, as NVIDIA (NVDA.US) is set to launch a new generation platform called Blackwell by the end of 2024, large CSPs (cloud service providers) will also begin to build AI Server data centers based on the Blackwell platform. It is estimated that this could potentially drive the penetration rate of liquid cooling solutions to 10%.

Parallel Air Cooling and Liquid Cooling Solutions Meet Higher Cooling Demands

According to TrendForce's research, the NVIDIA Blackwell platform is expected to be officially mass-produced in 2025, replacing the existing Hopper platform and becoming the main solution for NVIDIA's high-end GPUs, accounting for nearly 83% of the overall high-end products. In AI Server models such as B200 and GB200 that pursue high efficiency, the power consumption of a single GPU can exceed 1,000W. HGX models with 8 GPUs per unit, and NVL models with 36 or 72 GPUs per cabinet, will generate considerable energy consumption, promoting the growth of the AI Server liquid cooling supply chain.

TrendForce mentioned that the Thermal Design Power (TDP) of server chips continues to increase, with the TDP of the B200 chip reaching 1,000W. Traditional air cooling solutions are insufficient to meet the demand. The TDP of the GB200 NVL36 and NVL72 cabinets will even reach up to 70kW and nearly 140kW, requiring liquid cooling solutions to effectively address the cooling issues.

According to TrendForce, the initial architecture of the GB200 NVL36 cabinet will mainly use parallel air and liquid cooling solutions, while the NVL72, due to higher cooling capacity requirements, will prioritize the use of liquid cooling solutions.

Observing the current liquid cooling supply chain of the GB200 cabinet system, it mainly consists of five major components: Cold Plate, Coolant Distribution Unit (CDU), Manifold, Quick Disconnect (QD), and Rear Door Heat Exchanger (RDHx).

TrendForce pointed out that the CDU is a key system among them, responsible for regulating the flow of coolant throughout the system to ensure that the cabinet temperature is controlled within the preset TDP range TrendForce's observation shows that currently, for NVIDIA's AI solutions, Vertiv is the main CDU supplier, with ongoing testing and verification by Qinhong, Shuanghong, Delta Electronics, CoolIT, etc .

By 2025, the shipment volume of GB200 is estimated to reach 60,000 cabinets, driving the Blackwell platform to become mainstream in the market, accounting for over 80% of NVIDIA's high-end GPUs.

According to TrendForce's observation, in 2025, NVIDIA will focus on CSPs and enterprise customers with diverse AI Server configurations such as HGX, GB200 Rack, and MGX, with an estimated shipment ratio of approximately 5:4:1. The HGX platform can seamlessly integrate with existing Hopper platform designs, enabling CSPs or large enterprise customers to quickly adopt them. The GB200 full-rack AI Server solution will target super-large CSPs, TrendForce anticipates NVIDIA will introduce the NVL36 configuration by the end of 2024 to quickly enter the market. Due to its complex overall design and cooling system, the NVL72 is expected to be launched in 2025.

TrendForce mentioned that with NVIDIA's strong expansion of CSPs customer base, it is estimated that the total shipment volume of GB200 equivalent to NVL36 in 2025 could reach 60,000 cabinets, and the Blackwell GPU usage in GB200 is expected to reach 2.1-2.2 million units .

However, there are still several variables in the process of end customers adopting the GB200 Rack. TrendForce pointed out that the NVL72 requires a more sophisticated liquid cooling solution, which poses higher difficulties . Liquid-cooled cabinet designs are more suitable for new data centers but involve complex procedures such as land and building planning. In addition, CSPs may not want to be tied to a single supplier's specifications and may choose models with x86 CPU architecture such as HGX or MGX, or expand their self-developed ASIC (Application-Specific Integrated Circuit) AI Server infrastructure to address lower costs or specific AI application scenarios