Errata: A Close Look at SRAM for Inference in the Age of HBM Supremacy
An update on SRAM bandwidth calculations.
In the original post, there was an error in our calculations regarding SRAM bandwidth using B200 as an example as shown below.
Each stack of HBM3e provides ~1 TB/s. With 8 stacks, that’s 8 TB/s of peak DRAM bandwidth.
At the GPU base clock of 700MHz, on-die SRAM effectively provides ~90 TB/s. This can be derived as follows:
Cache line size, i.e., the width of SRAM = 128 bytes
Clock speed = 700MHz, and the SRAM can be accessed on every clock cycle.
(128-byte cache line × 700 million accesses per second ≈ ~90 TB/s.)
This, in fact, comes out to be 90 GB/s, not 90 TB/s. This calculation misses two things:
The GPU clock frequency is actually 1.98 GHz, not 700 MHz.
The number of streaming multiprocessors in the GPU (148) was not included.
Our special thanks to the reader who pointed this out to us.
As much as we strive for accuracy, sometimes things slip by, and we would like to correct them at the earliest. The following section is rewritten with the correct numbers, and additional context. The original article is updated accordingly.
Correction
1. SRAM bandwidth is an order of magnitude higher than HBM
Take the NVIDIA B200 as an example:
Each stack of HBM3e provides ~1 TB/s. With 8 stacks, that’s 8 TB/s of peak DRAM bandwidth.
At the GPU clock of 1.98 GHz, on-die SRAM based L1 Cache effectively provides ~37.5 TB/s. This can be derived as follows:
Cache line size, i.e., the width of SRAM = 128 bytes
Clock speed = 1.98 GHz, and the SRAM can be accessed on every clock cycle.
148 Streaming Multiprocessors (SMs)
(128-byte cache line × 1.98 GHz x 148 SMs ≈ ~37.5 TB/s.)
In addition to the existing cache hierarchy, Blackwell’s fifth-generation Tensor Core introduces TMEM, a dedicated 256 KB SRAM per SM designed specifically to serve tensor core operations. Microbenchmarking results indicate that it “provides 16 TB/s read bandwidth and 8 TB/s write bandwidth per SM, and this bandwidth operates additively with L1/SMEM bandwidth rather than competing for the same resources.“ [source]. The introduction of TMEM highlights the performance potential unlocked by highly specialized, on-chip SRAM structures tightly coupled to compute.
Looking beyond GPUs to SRAM-based accelerators, the Groq LPU delivers approximately 80 TB/s of on-chip memory bandwidth, while d-Matrix’s Corsair PCIe card is specified to provide up to 150 TB/s of memory bandwidth.
The online store will also be updated with the correct version of the file. If you have already purchased the article, we will individually email you the updated files.
For paid subscribers, you will find an updated epub and pdf file of the post with corrections included after the paywall below.
Thank you for all your support and have a nice day!


