SemiAnalysis Massive Teardown: Full Blackwell Architecture Details, NVIDIA's Never-Before-Revealed Secrets

SemiAnalysis conducts the first teardown of NVIDIA's Blackwell architecture: under AI workloads, Tensor Core and memory bandwidth overall approach theoretical peaks, but performance is highly dependent on instruction shapes and software tuning. 2SM MMA achieves near-perfect scaling, while SMEM bandwidth and a cross-die latency of approximately 300 cycles emerge as key bottlenecks. The research reveals that the release of Blackwell's performance depends not on hardware limits, but on scheduling and optimization capabilities

NVIDIA's Blackwell GPU represents one of the most significant GPU microarchitectural changes in recent years, yet a detailed official whitepaper has been lacking until now.

SemiAnalysis, a renowned semiconductor research institution, has spent months conducting systematic micro-benchmarking of the Blackwell architecture, releasing for the first time hardware performance ceiling data under AI workloads.

The test results show that Blackwell approaches theoretical peak values in key dimensions such as Tensor Core throughput, memory subsystem bandwidth, and the new 2SM MMA instructions. However, performance is highly dependent on instruction shape configuration, with noticeable bandwidth bottlenecks in some scenarios. This finding has direct implications for AI infrastructure investors and chip purchasers—unlocking the architecture's potential hinges on fine-tuning at the software level.

SemiAnalysis has open-sourced the relevant benchmark codebase, with B200 nodes provided by Nebius and Verda for testing. The research team also announced plans to expand benchmarking to TPU Pallas kernels, Trainium NKI kernels, and AMD CDNA4 assembly.

Core Architectural Changes: Introduction of TMEM and 2SM MMA

From Hopper to Blackwell, NVIDIA has made several significant adjustments to the PTX abstraction layer for MMA-related instructions.

The most notable change is the introduction of Tensor Memory (TMEM) for storing MMA accumulators. In previous architectures, threads implicitly held the results of MMA operations; Blackwell shifts to explicit management of TMEM by software within the MMA scope, altering the ownership relationship between threads and computation results.

Concurrently, the tcgen05 operation is now issued by a single thread representing the entire CTA (Cooperative Thread Array), rather than by a warp or warpgroup as in the previous Hopper architecture. This change is directly reflected in the CuTe MMA atoms: Blackwell uses ThrID = Layout<_1>, while Hopper uses ThrID = Layout<_128>.

Blackwell also introduces TPC-scoped TMA and MMA, supporting two collaborative CTAs across SMs to execute tcgen05.mma and share operands. This reduces the shared memory bandwidth requirements per CTA while offering MMA instructions with higher computational intensity. Additionally, the architecture natively supports sub-byte data types with micro-scaling and introduces Cluster Launch Control (CLC) as hardware support for dynamic work scheduling in persistent CTA kernels.

Chip Physical Layout: Dual-Die Architecture and 300-Cycle Cross-Die Latency

SemiAnalysis has revealed the physical topology of the B200 chip through reverse engineering.

Using PTX %%smid instructions and launching clusters of varying sizes, the research team inferred the mapping relationship between SMs and GPCs (Graphics Processing Clusters). The results show that B200 has some GPCs with exclusively dedicated TPCs, which never schedule collaboratively with other TPCs.

By having each SM traverse and fill pointer-tracing arrays that saturate the L2 cache and measuring access latency between SMs, the research team constructed an inter-SM distance matrix. The matrix clearly shows two groups of SMs with an average L2 access latency difference exceeding 300 clock cycles, corresponding precisely to the cross-die access penalty between the two dies.

Based on this, the research team infers the die-level TPC distribution for B200 as follows:

Die A: Each GPC contains 10, 10, 10, and 9 TPCs respectively
Die B: Each GPC contains 9, 9, 9, and 5+3 TPCs respectively

This difference in physical layout implies that even two GPUs with identical logical configurations might have different physical SM distributions, creating a potential source of performance non-determinism.

Memory Subsystem: Performance Boundaries of LDGSTS and TMA

Memory subsystem testing focused on two types of asynchronous copy instructions: LDGSTS (asynchronous copy) and TMA (Tensor Memory Accelerator).

Regarding LDGSTS, testing covered typical configurations of FlashInfer's Multi-Head Attention (MHA) kernels. Results showed that LDGSTS memory throughput saturates at 32 KiB of in-flight bytes, reaching a peak of approximately 6.6 TB/s. 16-byte loads performed slightly better than 8-byte loads with the same in-flight bytes and consumed fewer execution resources. Latency tests indicated a baseline LDGSTS latency of around 600 nanoseconds. Latency nearly doubled when in-flight bytes exceeded 8 KiB, attributed to numerous threads stalling due to MIO (Memory Input/Output) throttling.

Regarding TMA, peak throughput was achieved significantly later than with LDGSTS. Below 32 bytes of in-flight data, asynchronous copy throughput slightly outperformed TMA; above this threshold, TMA caught up and could scale continuously up to 128 KiB. In terms of latency, asynchronous copy had slightly lower latency below 12 KiB of in-flight data, but TMA latency increased sharply beyond that point.

TMA multicast tests showed that explicit TMA multicast perfectly eliminated L2 traffic, achieving an ideal "1/cluster size" L2 byte ratio. Implicit multicast (where each CTA independently issues TMA loads for the same data) achieved comparable effective memory throughput to explicit multicast, but the reduction in L2 cache traffic began to decrease for in-flight data exceeding 64 bytes.

Tensor Core Performance: Significant Shape Dependency, Perfect Weak Scaling with 2SM MMA

Tensor Core testing was a central part of this research, with results revealing Blackwell MMA's high sensitivity to instruction shape.

In terms of throughput, for 1SM MMA, the M=64 configuration reached only 50% of the theoretical peak, whereas M=128 approached 100%. This confirms that M=64 utilized only half the data paths. For 2SM MMA, M=128 at N=64 achieved 90% of peak throughput, with other N dimensions approaching 100%. M=256 maintained nearly 100% peak throughput across all configurations because M=256 is equivalent to processing M=128 per SM, fully utilizing the complete data paths.

AB layout impact was also significant. When both input matrices are stored in shared memory (SS mode), M=128 exhibits a clear SMEM bandwidth bottleneck for N<128. Taking FP16 as an example, the hardware can execute 8192 MMA FLOPs per cycle, with SMEM bandwidth at 128 B/cycle. Calculations show that for the M=128, N=64, K=16 configuration, SMEM requires 48 cycles, while mathematical operations only need 32 cycles, indicating the instruction is constrained by SMEM bandwidth. This pattern holds for all data types—MMA instructions where both operands reside in SMEM are constrained by SMEM bandwidth for N<128.

2SM MMA achieved perfect weak scaling, delivering 2x acceleration compared to 1SM MMA when using twice the computational resources. In small-shape configurations of SS mode, due to operand B being sharded between two SMs, acceleration exceeding 2x was even observed. The research conclusion is clear: always use the largest instruction shape available for a given SMEM tile size to achieve maximum throughput.

Regarding latency, latency increased linearly with N from 64 to 128 across all configurations, with a jump occurring at N=256. The latency ordering for data types showed a pattern: S8 < BF16 = E4M3 = F4 < MXF8 = MXF4. The research team posits that integer operations' higher power efficiency leads to S8 being the fastest, while the scaling factor calculations for micro-scaled data types introduce a slight overhead.

Actual in-flight instruction count tests revealed that in scenarios using 1 to 4 in-flight MMA instructions, typical for kernels, the throughput ceiling for 4 in-flight MMA instructions was around 78% to 80% of the theoretical peak, with 1SM MMA showing about a 5 percentage point advantage over 2SM MMA.