A comprehensive understanding of NVIDIA's "new GPU": 5 times faster than H100? 1200W power consumption? Liquid cooling? How does it compare to MI300X?
Compared to its competitors, Blackwell demonstrates outstanding performance, but in order to fully unleash Blackwell's potential, switching to liquid cooling has become almost necessary
Author: Zhao Ying
Source: Hard AI
"Hopper is great, but we need more powerful GPUs," after two years, Huang Renxun announced the launch of the new generation Blackwell architecture GPU at the NVIDIA AI event GTC.
With the rise of generative AI, NVIDIA is attracting customers with more powerful chips, and the Blackwell architecture is highly anticipated for its performance leap.
According to media analysis on Monday, as the successor to the Hopper architecture, Blackwell has shown excellent performance improvements. The highest-spec Blackwell chip's floating-point operation speed (FLOPS) is about 5 times faster, with further optimization in energy consumption. Compared to the AMD MI300X GPU, it demonstrates strong competitiveness, consolidating NVIDIA's technical advantages in performance and energy efficiency.
The key to performance improvement lies in the price design of Blackwell. Each GPU actually integrates two Compute Dies, connected through 10TB/s NVLink-HBI (High Bandwidth Interface) technology, allowing them to function as a single accelerator.
In addition, around the two computing chips are 8 8-layer stacked HBM3e memories, with a total capacity of up to 192GB and a bandwidth of up to 8TB/s. Unlike the H100 and H200, the B100 and B200 maintain consistency in memory and GPU bandwidth. Currently, the Blackwell series includes three models: B100, B200, and Grace-Blackwell Superchip (GB200).
Moreover, maximizing performance is not an easy task and requires consideration of many factors. Although NVIDIA claims that the new chip's computing power can reach 20 petaflops, this performance indicator is based on the use of the newly introduced FP4 precision and measured under the conditions of liquid-cooled servers. To unleash the maximum potential of Blackwell, switching to liquid cooling has become almost necessary. When comparing the performance with the previous generation chip H100 at FP8 precision, the speed of the new chip has only increased by about 2.5 times.
Powerful Performance of GB200 Superchip
NVIDIA's most powerful GPU is integrated into the GB200 chip. Similar to the Grace-Hopper chip, the Grace-Blackwell Superchip combines the existing 72-core Grace CPU with the Blackwell GPU using NVLink-C2C connection technology.
However, unlike a single H100GPU, the GB200 is equipped with two Blackwell accelerators—achieving a computing performance of 40 petaflops and having 384GB of HBM3e memory The previous GH200 was labeled as 1000W - including a 700W GPU and a 300W Arm CPU. Therefore, it can be roughly estimated that at full load, the GB200 - including two GPUs, each at 1200W, and the same Arm CPU - may have a total power consumption of around 2700W. Hence, it is not surprising that NVIDIA has directly adopted a liquid cooling system.
By removing bulky heat spreaders and installing several cooling plates, NVIDIA is able to compactly install these two accelerators in a 1U rack system, which can provide computing performance of up to 800 million billion floating-point operations, or achieve 40 million billion floating-point operations in FP8 calculations.
Compared to the previous generation, this dual GB200 system can provide more computing performance than its 8U 10.2kW DGX H100 system - 40 petaflops compared to 32 petaflops - while reducing the required space to one-eighth.
The new generation NVLink connection scheme significantly boosts performance
The GB200 constitutes the core of the Nvidia NVL72 rack-level AI system, where the GB200 NVL72 is a rack-level system that uses NVLink switches to combine 36 GB200s into one system. This system is designed to support large-scale training and inference tasks, capable of handling large language models with up to 270 trillion parameters.
According to NVIDIA, in the training domain, this system can achieve a performance of 720 petaflops at FP8 precision. In terms of inference workloads, the system's computing power can reach 1.44 exaFLOPS at FP4. If that's not enough, eight NVL72 racks can be interconnected to form the "Behemoth" DGX BG200 Superpod.
Each rack is equipped with 18 nodes, totaling 32 Grace GPUs and 72 Blackwell accelerators. These nodes are then interconnected through a series of nine NVLink switches, allowing these nodes to work like a single GPU node with 13.5TB HBM3e memory.
This is essentially the same technology Nvidia used in previous DGX systems, enabling eight GPUs to operate like a single GPU. The difference is that Nvidia utilizes dedicated NVLink devices to support more GPUs. The new generation NVLink provides each GPU with 1.8TB/s bidirectional bandwidth, supporting seamless high-speed communication between up to 576 GPUs.
Increased cooling requirements, liquid cooling becoming essential
Although NVIDIA's new generation products do not mandate the use of liquid cooling, if one wishes to fully utilize NVIDIA's flagship chips, liquid cooling is almost a necessity The main differences between B100, B200, and GB200 lie in power and performance. According to NVIDIA, the power consumption of these chips can range from 700W to 1200W, depending on the specific model and cooling method.
With different power working states, the performance of the chips will naturally vary. NVIDIA points out that devices using air cooling systems, such as HGX B100, can achieve a speed of 14 petaflops per GPU, while consuming power equivalent to H100. This means that if a data center is already capable of supporting NVIDIA's DGX H100 system, introducing B100 nodes should not be a problem.
On the other hand, B200 is more intriguing. In HGX or DGX architectures with air cooling, each GPU in B200 can provide a computing capability of 18 petaflops, while consuming one kilowatt of power. According to NVIDIA, the total power consumption of a DGX B200 chassis equipped with 8 B200 GPUs is approximately 14.3 kW, which means an additional capacity of about 60 kW is required in terms of rack power and cooling.
For a new data center specifically designed for AI clusters, this is not a problem; however, for existing facilities, the challenge may be greater.
In the field of AI data centers, turning to liquid cooling has become almost necessary to unleash the full potential of Blackwell. In a liquid-cooled configuration, the heat output of the chips at full load can reach 1200W, while achieving a performance of 20 petaflops.
Compared to competitors, Blackwell still holds advantages
While NVIDIA is currently dominating the AI infrastructure market, it is not the only player. Heavyweight competitors like Intel and AMD are introducing Gaudi and Instinct accelerators, cloud service providers are pushing for custom chips, and AI startups like Cerebras and Samba Nova are also making their mark in the competition.
Compared to AMD's MI300X GPU launched in December last year, Blackwell still maintains advantages:
MI300X utilizes advanced packaging technology, stacking eight CDNA 3 compute units vertically on four I/O chips, providing high-speed communication between GPUs and 192GB HBM3 memory.
In terms of performance, MI300X offers a 30% performance advantage in FP8 floating-point calculations and nearly a 2.5 times lead in high-performance computing-intensive double-precision workloads dominated by Nvidia H100. When comparing the 750W MI300X with the 700W B100, NVIDIA's chip is 2.67 times faster in sparse performance In addition, although both chips now contain 192GB of high-bandwidth memory, the memory speed of the Blackwell component has increased to 2.8TB/s. Memory bandwidth has been proven to be a key indicator of AI performance, especially in inference. For example, the NVIDIA H200 is essentially a version of the H100 with enhanced bandwidth. Despite having the same FLOPS, NVIDIA claims that the speed of the H200 in models like Meta's Llama2 70B is twice that of the H100.
While NVIDIA maintains a clear lead in the low-precision field, it may have sacrificed double-precision performance. In recent years, AMD has excelled in this type of performance, winning multiple awards for high-end supercomputers.
Analysis predicts that the demand for new AI products in 2024 will far exceed supply. In this scenario, winning market share does not always mean having faster chips; the key is which chips can be launched and shipped. Although the performance of Blackwell is exciting, it will take some time before buyers get their hands on them. The production ramp-up for B200 and GB200 seems to be waiting until early 2025