Compared to the H100, how does the performance of NVIDIA's AI chips specially designed for China, fare?
In theory, the H100 is more than 6 times faster than the H20 in terms of speed. However, when it comes to LLM inference, the H20 is more than 20% faster than the H100.
According to the latest reports from the media, NVIDIA is about to launch at least three new AI chips, including H20 SXM, PCIe L20, and PCIe L2, to replace the H100, which is restricted for export by the United States. These three chips are all based on the Hopper GPU architecture, with a maximum theoretical performance of 296TFLOPs (floating-point operations per second, also known as peak speed).
It is almost certain that these three AI chips are "cut-down" or "watered-down" versions of the H100.
Theoretically, the H100 is 6.68 times faster than the H20. According to a recent blog post by analyst Dylan Petal, even if the actual utilization rate of the H20 can reach 90%, its performance in practical multi-card interconnection environments can only be close to 50% of the H100.
Some media also claim that the comprehensive computing power of the H20 is only equivalent to 20% of the H100, and the cost of computing power has significantly increased due to the addition of HBM memory and NVLink interconnection modules.
However, the advantages of the H20 are also obvious. It is more than 20% faster than the H100 in large language model (LLM) inference. The reason is that the H20 is similar to the next-generation super AI chip H200, which will be released next year, in some aspects.
NVIDIA has already produced samples of these three chips, and H20 and L20 are expected to be launched in December this year, while L2 will be launched in January next year. Product sampling will begin one month before the release.
H20 Vs. H100
Let's start with the H100, which has 80GB of HBM3 memory, a memory bandwidth of 3.4Tb/s, and a theoretical performance of 1979 TFLOPs. Its performance density (TFLOPs/Die size) is as high as 19.4, making it the most powerful GPU in NVIDIA's current product line.
The H20 has 96GB of HBM3 memory, with a memory bandwidth of up to 4.0 Tb/s, both higher than the H100. However, its computational power is only 296 TFLOPs, and its performance density is 2.9, far inferior to the H100.
Theoretically, the H100 is 6.68 times faster than the H20. But it is worth noting that this comparison is based on the floating-point calculation capability of FP16 Tensor Cores (FP16 Tensor Core FLOPs), and sparse computing is enabled (significantly reducing the amount of computation, resulting in a significant speed increase), so it does not fully reflect all of its computing capabilities. In addition, the thermal design power of this GPU is 400W, lower than the 700W of H100. It can be configured with 8 GPUs in the HGX solution (NVIDIA's GPU server solution). It also retains the high-speed interconnection function of NVLink with a bandwidth of 900 GB/s, and provides the functionality of 7 MIG (Multi-Instance GPU).
H100 SXM TF16 (Sparsity) FLOPS = 1979
H20 SXM TF16 (Sparsity) FLOPS = 296
According to Peta's LLM performance comparison model, H20 has a peak token/second under moderate batch size, which is 20% higher than H100. The token-to-token latency of H20 is 25% lower than H100 under low batch size. This is because the number of chips required for inference is reduced from 2 to 1. If 8-bit quantization is used, the LLAMA 70B model can run effectively on a single H20 instead of requiring 2 H100.
It is worth mentioning that although the computational power of H20 is only 296 TFLOPS, far less than the 1979 TFLOPS of H100, if the actual utilization rate MFU (currently H100's MFU is only 38.1%) is taken into account, it means that H20 can actually achieve 270 TFLOPS. Therefore, the performance of H20 in actual multi-GPU interconnection environment is close to 50% of H100.
From the perspective of traditional computing, H20 is somewhat downgraded compared to H100. However, in terms of LLM inference, H20 is actually more than 20% faster than H100. The reason is that H20 is similar to H200, which will be released next year, in some aspects. Note that H200 is the successor to H100, a super chip for complex AI and HPC workloads.
L20 and L2 have more streamlined configurations
At the same time, L20 is equipped with 48GB of memory and has a computational performance of 239 TFLOPS, while L2 has 24GB of memory and a computational performance of 193 TFLOPS.
L20 is based on L40, and L2 is based on L4, but these two chips are not commonly used in LLM inference and training.
Both L20 and L2 adopt the PCIe form factor and use the PCIe specifications suitable for workstations and servers. Compared with higher-spec models such as Hopper H800 and A800, the configurations are more streamlined.
However, NVIDIA's software stack for AI and high-performance computing is very valuable to certain customers, to the extent that they are unwilling to give up the Hopper architecture, even if the specifications are downgraded. The FLOPs (Floating Point Operations) for L40 TF16 (Sparsity) is 362.
The FLOPs for L20 TF16 (Sparsity) is 239.
The FLOPs for L4 TF16 (Sparsity) is 242.
The FLOPs for L2 TF16 (Sparsity) is 193.