top of page
DALL·E 2024-09-04 20.50.43 - An image featuring the text 'The ASIC Brew' with a sleek, mod

SRAM Alone: The Bandwidth King That Falls Short on Capacity for LLMs

  • Writer: Mohamed Abdelgawad
    Mohamed Abdelgawad
  • Dec 20, 2024
  • 4 min read

Updated: Jan 12

When it comes to running large language models like LLaMA-3-70B, one fact stands out: while on-chip SRAM excels at providing high bandwidth, relying solely on it is inefficient due to its limited capacity. Fetching 140 GB of weights and several GB of KV cache demands not just staggering bandwidth but also massive memory capacity, something that stacking more chips with a few hundred MB of SRAM simply cannot solve given the inefficiencies in power consumption and cost. No matter how fast you claim your token generation speed is (tokens/sec), your token/s/W/$ (performance per watt per dollar) will remain uncompetitive.


In addition to on-chip SRAM (that were once enough to hold entire models in the 2016-era) , you need DRAM with enough capacity so you don’t have to drastically scale out just to run a single LLM (though scaling out will still be necessary, just less painfully). Current competitive solutions mainly rely on on-package HBM - the monstrous, expensive high-bandwidth memory that can stack a couple of hundreds of GB on-package and provide several TB/s of bandwidth - example include giants like Nvidia and AMD and increasingly adopted by startups for their next-generation chips.

The current-generation products of startups combine on-chip SRAM with off-package LPDDR or GDDR; examples include Tenstorrent Blackhole, Untether AI SpeedAI , d-matrix Corsair . Meanwhile, some designs combine on-chip SRAM, on-package HBM and off-package LPDDR, such as SambaNova SN40L. I won’t claim to know the perfect balance between the different types of memory (if such a balance even exists); each memory type comes with its own trade-offs, ultimately shaping the product's market fit and determining where it can compete effectively . However, what I can do is show LLMs memory footprint that make relying solely on SRAM an impractical path forward.


Unpacking LlaMA's Memory Appetite

LlaMA has emerged as the go-to architecture in open-source LLMs, significantly closing the performance gap with proprietary models and even outperforming some of them. In this article, I will use LlaMA 3 70B as my reference model. In the decode phase of LlaMA inference (the phase after the model has processed the prompt and started generating tokens), tokens are generated sequentially. For each generated token, the model must fetch all weight matrices and the KV cache. The memory footprint of the weights is straightforward to calculate: multiply the number of model parameters by the precision used for the weights.

The Key (K) and Value (V) vectors, on the other hand, are toke-dependent intermediate results of the attention computation. Due to the iterative nature of the model, where previously generated tokens are used to generate the next one, K and V vectors can be reused, hence 'caching' them save compute resources during inference. The memory footprint of the KV cache grows linearly with the batch size (B), sequence length (L), number of key/value heads (n_kvheads, LLaMA 3.1 employs Grouped Query Attention (GQA), where a group of query heads share the same key and value heads. This reduces the total number of key/value heads (n_kvheads) , dimension of each head (d_head) , number of layers (n_layers) and precision. To understand how K and V are generated, please refer to Figure 1 which depicts first step of the attention computation. The K and V vectors are derived from linear projections of token embeddings tensor and weight matrices for K (Wk) and V (Wv). The token embedding tensor has dimensions B x L x d_model, where d_model is the length of the token representation in the model, aslo knows as hidden size. When projected onto the (Wk) and V (Wv) which have dimensions d_model x (n_kvheads x d_head), the resulting tensors for K and V have dimensions B x L x (n_kvheads x d_head). K and V are unique per model layer and hence the n_layers in the computation of the KV Cache size.

Figure 1. KV tensors computation in Llama 3. K weights matrix (Wk) and V weight matrix (Wv)  get projected onto token embeddings tensor to compute the K and V tensors, i.e. KV cache. B is the batch size, L is the sequence length, d_model is the hidden size, n_kv_heads is the number of KV head,  and d_head is the dimension of each head.  [Image by writer]
Figure 1. KV tensors computation in Llama 3. K weights matrix (Wk) and V weight matrix (Wv) get projected onto token embeddings tensor to compute the K and V tensors, i.e. KV cache. B is the batch size, L is the sequence length, d_model is the hidden size, n_kv_heads is the number of KV head, and d_head is the dimension of each head. [Image by writer]

At the first generate token, the sequence length is the input prompt length. As the model continues generating tokens, the sequence length increases by 1 with each iteration. Consequently, the KV cache size grows dynamically during inference, rather than remaining constant across iterations. Removing the sequence length from the formula gives us the KV cache size per token.

Looking at Meta's Llama 3 Herd of Models paper, Llama 3 70B model has the following architecture parameters

Using 16-bit ( 2-byte) precision (e.g. BF16, FP16) , we get the following memory footprint of Llama 3 70B for different batches and sequence length.

Sequence Length (prompt + output tokens)

Batch Size

KV Cache Size

Weights Size

Total Memory Footprint

4096

1

1.34 GB

140 GB

141.34 GB

4096

4

5.36 GB

140 GB

145.46 GB

4096

16

21.47 GB

140 GB

161.47 GB

4096

64

85.5 GB

140 GB

225.5 GB

8092

1

2.68 GB

140 GB

142.68 GB

8092

4

10.73 GB

140 GB

150.73 Gb

8092

16

42.94 GB

140 GB

182.94 GB

8092

64

171.79 GB

140 GB

311.79 GB

Looking at the KV cache size, on-chip SRAM can be effective for storing only the KV cache in scenarios with very short sequence lengths and minimal batching (i.e., a limited number of concurrent users). However, in most practical cases, relying solely on on-chip SRAM forces you to scale out drastically, potentially requiring hundreds of chips and several racks, just to accommodate a single model with a reasonable sequence length and batch size. Even if you decide to dedicate more area on your chip to increase SRAM at the expense of compute resources, it’s unlikely to scale efficiently. For example, Groq, relying only on on-chip SRAM, uses around 640 chips to run Llama-3-70B model, making achieving competitive token/s/W/$ challenging.


Final Thoughts

The memory capacity and bandwidth requirements of large language models like LLaMA-3 -70B have fundamentally reshaped the design landscape for AI hardware. Some startups continue to repurpose solutions designed for other models, while others are innovating a family of products tailored to efficiently meet the diverse needs of different models. The future of AI hardware will belong to those who create products that strike the perfect balance between bandwidth, capacity, power, and cost, all while scaling effectively for real-world workloads.

Comments


bottom of page