A Practical Guide to LLM GPU Memory Estimation
If you’ve ever launched a training run and immediately hit an OOM, this post is for you. GPU memory estimation for LLM training isn’t black magic, but it does involve tracking several interacting variables. Most online guides either hand-wave the details or dump a single formula without context. I want to do better here – walk through each component, show where the bytes actually go, and then ground everything in a real example with Qwen-2.5-7B on H100s.
The Memory Equation – What Eats Your VRAM
At a high level, GPU memory during training breaks down into five buckets:
Memory ≈ Model_Params + Gradients + Optimizer_States + Activations + Overhead
Each of these scales differently, and each responds differently to the knobs you can turn (precision, batch size, parallelism). Let’s go through them one by one.
Model Parameters. The simplest component. Total parameter count times bytes per parameter. A 7B model in bf16 is 7 billion * 2 bytes = 14 GB. In fp32 it would be 28 GB. The architecture (number of layers, hidden dimension, FFN width, attention heads) determines the total count, but for estimation purposes, the headline number is all you need.
Gradients. During backprop, every parameter gets a gradient tensor of the same shape and precision. So gradients cost the same as model parameters: params * bytes_per_param.
Optimizer States. This is where people get surprised. Covered in detail below.
Activations. The most complex and hardest to predict. Depends on batch size, sequence length, hidden dimension, and number of layers. Also covered below.
Overhead. CUDA context alone eats several hundred MB. Framework internals (PyTorch autograd graph, communication buffers, temporary tensors) add more. Rule of thumb: budget 10-20% on top of everything else.
Precision Formats and Why They Matter
The bytes-per-parameter choice cascades through almost everything:
| Format | Bytes/Param | Typical Use |
|---|---|---|
| float32 | 4 | Legacy training, some optimizer internals |
| float16 / bfloat16 | 2 | Standard mixed-precision training |
| int8 | 1 | Quantized inference, QLoRA |
| int4 | 0.5 | Aggressive quantization |
Moving from fp32 to bf16 halves the memory for parameters, gradients, and activations. That’s a massive win, and it’s why mixed-precision training is now the default for anything above a few billion parameters.
But there’s a catch: not everything can move to lower precision. Optimizer states are the big exception.
The Optimizer State Trap
Adam and AdamW maintain two additional tensors per parameter: the first moment (momentum) and the second moment (variance). Each is stored in fp32, giving you 4 + 4 = 8 bytes per parameter just for optimizer state.
These are always stored in fp32, regardless of your model precision. This is non-negotiable. If you store optimizer states in fp16, the small update magnitudes get rounded to zero and training diverges. Every serious training framework keeps them in full precision.
Here’s what this means concretely for a 7B model:
| Component | Bytes/Param | Total (7B) |
|---|---|---|
| Params (bf16) | 2 | 14 GB |
| Gradients (bf16) | 2 | 14 GB |
| Adam states (fp32) | 8 | 56 GB |
The optimizer states alone are 4x the model size in bf16. For a 7B model, Adam needs 56 GB just for its moment estimates. This is often the single largest memory consumer, and it’s the main reason you can’t just “train in fp16” and expect everything to fit.
For comparison, SGD with momentum only needs 4 bytes/param (one momentum buffer), and vanilla SGD with no momentum needs zero extra bytes. But nobody trains LLMs with vanilla SGD, and for good reason.
Activation Memory – The Hard Part
Activations are the intermediate tensors saved during the forward pass for use during backprop. Unlike parameters and optimizer states, activation memory scales with your runtime configuration – batch size and sequence length in particular.
A rough approximation:
Activation_Memory ≈ k * batch_size * seq_len * hidden_size * num_layers * bytes_per_activation
The constant k is empirical and architecture-dependent. For standard transformers it typically falls in the range of 2-10+, depending on how many intermediate tensors each layer saves. Attention layers are particularly expensive because they materialize the full attention matrix (batch * heads * seq_len * seq_len) unless you’re using FlashAttention or a similar memory-efficient variant.
This is the least precise part of any memory estimate. The exact number depends on:
- Whether you use activation checkpointing (recomputation)
- Which layers you checkpoint and how many
- Whether attention is memory-efficient (FlashAttention, etc.)
- Framework-specific implementation details
Activation checkpointing trades compute for memory: instead of saving all intermediates, you discard them and recompute during backward. Checkpointing every layer can cut activation memory by 50-70%, at the cost of roughly 30% more compute. Most frameworks let you specify exactly which layers to checkpoint, so you can tune the trade-off.
How Parallelism Strategies Distribute Memory
Once your model doesn’t fit on one GPU (or you need larger batch sizes), you need to distribute the work. Each parallelism strategy has different memory implications.
Data Parallelism (DP)
Every GPU holds a full copy of the model, optimizer states, and gradients. Each GPU processes a different data shard. Gradients are all-reduced after each step.
Memory impact: essentially none. Each GPU still needs the full model. DP reduces wall-clock time by processing more data in parallel, but it doesn’t help per-GPU memory. This is why pure DP tops out quickly for large models.
Tensor Parallelism (TP)
Weight matrices are split across GPUs within each layer. Each GPU holds a shard of the parameters, gradients, and optimizer states.
Memory_per_GPU ≈ (Params + Gradients + Optimizer_States) / TP_size + partial_activations
TP is very effective at reducing memory, but it requires fast interconnect (NVLink or equivalent) because every forward and backward pass involves all-reduce communication within each layer.
Pipeline Parallelism (PP)
The model is split by layer groups across GPUs. GPU 0 gets layers 0-7, GPU 1 gets layers 8-15, and so on. Each GPU only stores its own layers’ parameters, optimizer states, gradients, and activations.
The trade-off is pipeline bubbles – GPUs sitting idle while waiting for activations from the previous stage. Micro-batching helps fill the pipeline, but some bubble overhead is unavoidable.
Expert Parallelism (EP)
Specific to Mixture-of-Experts models. Experts are distributed across GPUs, so each GPU only stores a subset of the total experts. The routing layers and shared (non-expert) layers are typically replicated. Requires all-to-all communication for token dispatch.
Combined (3D Parallelism)
In practice, large-scale training combines TP, PP, and DP together. The ideal memory per GPU for parameters, gradients, and optimizer states:
Memory_per_GPU ≈ Total / (TP * PP)
DP doesn’t reduce per-GPU model memory. EP applies separately to expert layers. Communication buffers and activation overlap mean reality is always somewhat worse than the ideal formula suggests.
Worked Example: Qwen-2.5-7B on H100
Let’s put real numbers on this. I’ll walk through a memory breakdown for training Qwen-2.5-7B in two configurations: an 8-card H100 setup that barely fits, and a 4-card optimized setup that runs comfortably.
Model specs: 7B parameters, 28 transformer layers, bf16 training, AdamW optimizer.
8-Card Configuration (Original)
TP=4, seq_len=8192, rollout-batch-size=32, n-samples-per-prompt=8:
| Component | Per-GPU Memory | Notes |
|---|---|---|
| Model parameters | 14.0 GB | bf16, 7B * 2 bytes |
| Gradients | 5.6 GB | Sharded across TP=4 |
| Optimizer states | 22.4 GB | Adam fp32, partially sharded |
| KV cache | 24.5 GB | seq=8192, 28 layers |
| Activations | 9.8 GB | micro-batch=1, recompute 1 layer |
| Framework overhead | 3.7 GB | SGLang + Ray + PyTorch runtime |
| Total | ~80 GB | Right at the H100 80 GB limit |
This works, but it’s skating on the edge. Any spike in activation memory or a slightly longer sequence pushes you into OOM territory. Not a configuration you want to leave running overnight.
4-Card Optimized Configuration
To fit on fewer GPUs with real safety margin, I made four targeted changes:
--rollout-batch-size 16 # was 32 (halved)
--n-samples-per-prompt 4 # was 8 (halved)
--rollout-max-response-len 4096 # was 8192 (halved)
--recompute-num-layers 2 # was 1 (trade compute for memory)
Result: ~52 GB per GPU, a 35% reduction with comfortable headroom.
| Component | Per-GPU Memory |
|---|---|
| Model parameters | 14.0 GB |
| Gradients | 7.0 GB |
| Optimizer states | 28.0 GB |
| KV cache | ~12.0 GB |
| Activations | ~7.5 GB |
| Framework overhead | 3.7 GB |
| Total | ~52 GB |
What Gave the Biggest Win
The KV cache reduction was the single largest contributor. KV cache memory follows this formula:
KV_mem = 2 * num_layers * hidden_size * seq_len * batch_size * bytes_per_element
Halving the sequence length from 8192 to 4096 cuts KV cache by 50%. Since KV cache was the largest single component at ~24.5 GB in the original config, this alone saved over 12 GB per GPU. Everything in the formula is multiplicative, so cutting multiple factors compounds quickly.
The Trade-offs
Nothing is free. Here’s what the optimized config gives up:
| Metric | 8-Card Original | 4-Card Optimized |
|---|---|---|
| Peak memory / GPU | ~80 GB | ~52 GB |
| Throughput (relative) | 100% | 65-70% |
| Max sequence length | 8192 | 4096 |
| Training stability | Critical (near limit) | Safe (28 GB headroom) |
You lose 30-35% throughput and halve your max sequence length. Whether that’s acceptable depends on your task. For many fine-tuning and RLHF jobs, 4096 tokens is plenty.
Practical Tips and OOM Emergency Playbook
When you’re hitting OOM during training, here’s what to try, ordered by impact and minimal disruption to training dynamics:
1. Cut sequence length. Reduce max-response-len or max-seq-len. Going from 8192 to 3072 can free 10+ GB. Both KV cache and attention activations scale linearly (or worse) with sequence length, making this the single biggest lever you can pull.
2. Reduce micro-batch size, increase gradient accumulation. Same effective batch size, much less activation memory per step. The trade-off is more sequential steps per parameter update, which increases wall-clock time.
3. Reduce samples per prompt. If you’re doing RL-style training (GRPO, PPO), cutting n-samples-per-prompt directly reduces the number of rollouts held in memory simultaneously.
4. Increase activation checkpointing. Recompute more layers during backward. This is a pure compute-for-memory trade: more FLOPs, less stored activations. Most frameworks let you specify exactly how many layers to checkpoint.
5. Switch optimizer. If you’re truly desperate, 8-bit Adam (from bitsandbytes) or Adafactor can cut optimizer state memory significantly. But test carefully – optimizer changes can affect convergence, and you don’t want to discover that 200 hours into a run.
Quick Estimation Shortcut
For a back-of-envelope check before you even start writing configs:
def estimate_training_memory_gb(param_billions, precision_bytes=2, optimizer="adam"):
params_gb = param_billions * precision_bytes
grads_gb = param_billions * precision_bytes
opt_bytes = {"sgd": 4, "adam": 8}[optimizer]
opt_gb = param_billions * opt_bytes
# Activations: rough lower bound at 1.5x param size
act_gb = params_gb * 1.5
subtotal = params_gb + grads_gb + opt_gb + act_gb
return subtotal * 1.15 # 15% overhead buffer
# 7B model, bf16, Adam
print(f"{estimate_training_memory_gb(7):.1f} GB") # ~73.6 GB
This won’t be precise for any specific configuration, but it tells you whether you’re in the right ballpark before you spend an hour writing launch scripts.
The bottom line: GPU memory estimation comes down to tracking all the components and knowing which ones dominate. Optimizer states are usually the surprise. Activations are the wildcard. And when things don’t fit, cut sequence length first – it’s almost always the highest-impact, lowest-disruption fix.