A Practical Guide to LLM GPU Memory Estimation

The Memory Equation – What Eats Your VRAM
Precision Formats and Why They Matter
The Optimizer State Trap
Activation Memory – The Hard Part
How Parallelism Strategies Distribute Memory
Worked Example: Qwen-2.5-7B on H100
Practical Tips and OOM Emergency Playbook
- Quick Estimation Shortcut

If you’ve ever launched a training run and immediately hit an OOM, this post is for you. GPU memory estimation for LLM training isn’t black magic, but it does involve tracking several interacting variables. Most online guides either hand-wave the details or dump a single formula without context. I want to do better here – walk through each component, show where the bytes actually go, and then ground everything in a real example with Qwen-2.5-7B on H100s.

The Memory Equation – What Eats Your VRAM

At a high level, GPU memory during training breaks down into five buckets:

Memory ≈ Model_Params + Gradients + Optimizer_States + Activations + Overhead

Each of these scales differently, and each responds differently to the knobs you can turn (precision, batch size, parallelism). Let’s go through them one by one.

Model Parameters. The simplest component. Total parameter count times bytes per parameter. A 7B model in bf16 is 7 billion * 2 bytes = 14 GB. In fp32 it would be 28 GB. The architecture (number of layers, hidden dimension, FFN width, attention heads) determines the total count, but for estimation purposes, the headline number is all you need.

Gradients. During backprop, every parameter gets a gradient tensor of the same shape and precision. So gradients cost the same as model parameters: params * bytes_per_param.

Optimizer States. This is where people get surprised. Covered in detail below.

Activations. The most complex and hardest to predict. Depends on batch size, sequence length, hidden dimension, and number of layers. Also covered below.

Overhead. CUDA context alone eats several hundred MB. Framework internals (PyTorch autograd graph, communication buffers, temporary tensors) add more. Rule of thumb: budget 10-20% on top of everything else.

Precision Formats and Why They Matter

The bytes-per-parameter choice cascades through almost everything:

Format	Bytes/Param	Typical Use
float32	4	Legacy training, some optimizer internals
float16 / bfloat16	2	Standard mixed-precision training
int8	1	Quantized inference, QLoRA
int4	0.5	Aggressive quantization

Moving from fp32 to bf16 halves the memory for parameters, gradients, and activations. That’s a massive win, and it’s why mixed-precision training is now the default for anything above a few billion parameters.

But there’s a catch: not everything can move to lower precision. Optimizer states are the big exception.

The Optimizer State Trap

Adam and AdamW maintain two additional tensors per parameter: the first moment (momentum) and the second moment (variance). Each is stored in fp32, giving you 4 + 4 = 8 bytes per parameter just for optimizer state.

These are always stored in fp32, regardless of your model precision. This is non-negotiable. If you store optimizer states in fp16, the small update magnitudes get rounded to zero and training diverges. Every serious training framework keeps them in full precision.

Here’s what this means concretely for a 7B model:

Component	Bytes/Param	Total (7B)
Params (bf16)	2	14 GB
Gradients (bf16)	2	14 GB
Adam states (fp32)	8	56 GB

The optimizer states alone are 4x the model size in bf16. For a 7B model, Adam needs 56 GB just for its moment estimates. This is often the single largest memory consumer, and it’s the main reason you can’t just “train in fp16” and expect everything to fit.

For comparison, SGD with momentum only needs 4 bytes/param (one momentum buffer), and vanilla SGD with no momentum needs zero extra bytes. But nobody trains LLMs with vanilla SGD, and for good reason.

Activation Memory – The Hard Part

Activations are the intermediate tensors saved during the forward pass for use during backprop. Unlike parameters and optimizer states, activation memory scales with your runtime configuration – batch size and sequence length in particular.

A rough approximation:

Activation_Memory ≈ k * batch_size * seq_len * hidden_size * num_layers * bytes_per_activation

The constant k is empirical and architecture-dependent. For standard transformers it typically falls in the range of 2-10+, depending on how many intermediate tensors each layer saves. Attention layers are particularly expensive because they materialize the full attention matrix (batch * heads * seq_len * seq_len) unless you’re using FlashAttention or a similar memory-efficient variant.

This is the least precise part of any memory estimate. The exact number depends on:

Whether you use activation checkpointing (recomputation)
Which layers you checkpoint and how many
Whether attention is memory-efficient (FlashAttention, etc.)
Framework-specific implementation details

Activation checkpointing trades compute for memory: instead of saving all intermediates, you discard them and recompute during backward. Checkpointing every layer can cut activation memory by 50-70%, at the cost of roughly 30% more compute. Most frameworks let you specify exactly which layers to checkpoint, so you can tune the trade-off.

How Parallelism Strategies Distribute Memory

Once your model doesn’t fit on one GPU (or you need larger batch sizes), you need to distribute the work. Each parallelism strategy has different memory implications.

Data Parallelism (DP)

Every GPU holds a full copy of the model, optimizer states, and gradients. Each GPU processes a different data shard. Gradients are all-reduced after each step.

Memory impact: essentially none. Each GPU still needs the full model. DP reduces wall-clock time by processing more data in parallel, but it doesn’t help per-GPU memory. This is why pure DP tops out quickly for large models.

Tensor Parallelism (TP)

Weight matrices are split across GPUs within each layer. Each GPU holds a shard of the parameters, gradients, and optimizer states.

Memory_per_GPU ≈ (Params + Gradients + Optimizer_States) / TP_size + partial_activations

TP is very effective at reducing memory, but it requires fast interconnect (NVLink or equivalent) because every forward and backward pass involves all-reduce communication within each layer.

Pipeline Parallelism (PP)

The model is split by layer groups across GPUs. GPU 0 gets layers 0-7, GPU 1 gets layers 8-15, and so on. Each GPU only stores its own layers’ parameters, optimizer states, gradients, and activations.

The trade-off is pipeline bubbles – GPUs sitting idle while waiting for activations from the previous stage. Micro-batching helps fill the pipeline, but some bubble overhead is unavoidable.

Expert Parallelism (EP)

Specific to Mixture-of-Experts models. Experts are distributed across GPUs, so each GPU only stores a subset of the total experts. The routing layers and shared (non-expert) layers are typically replicated. Requires all-to-all communication for token dispatch.

Combined (3D Parallelism)

In practice, large-scale training combines TP, PP, and DP together. The ideal memory per GPU for parameters, gradients, and optimizer states:

Memory_per_GPU ≈ Total / (TP * PP)

DP doesn’t reduce per-GPU model memory. EP applies separately to expert layers. Communication buffers and activation overlap mean reality is always somewhat worse than the ideal formula suggests.

Worked Example: Qwen-2.5-7B on H100

Let’s put real numbers on this. I’ll walk through a memory breakdown for training Qwen-2.5-7B in two configurations: an 8-card H100 setup that barely fits, and a 4-card optimized setup that runs comfortably.

Model specs: 7B parameters, 28 transformer layers, bf16 training, AdamW optimizer.

8-Card Configuration (Original)

TP=4, seq_len=8192, rollout-batch-size=32, n-samples-per-prompt=8:

Component	Per-GPU Memory	Notes
Model parameters	14.0 GB	bf16, 7B * 2 bytes
Gradients	5.6 GB	Sharded across TP=4
Optimizer states	22.4 GB	Adam fp32, partially sharded
KV cache	24.5 GB	seq=8192, 28 layers
Activations	9.8 GB	micro-batch=1, recompute 1 layer
Framework overhead	3.7 GB	SGLang + Ray + PyTorch runtime
Total	~80 GB	Right at the H100 80 GB limit

This works, but it’s skating on the edge. Any spike in activation memory or a slightly longer sequence pushes you into OOM territory. Not a configuration you want to leave running overnight.

4-Card Optimized Configuration

To fit on fewer GPUs with real safety margin, I made four targeted changes:

--rollout-batch-size 16           # was 32 (halved)
--n-samples-per-prompt 4          # was 8 (halved)
--rollout-max-response-len 4096   # was 8192 (halved)
--recompute-num-layers 2          # was 1 (trade compute for memory)

Result: ~52 GB per GPU, a 35% reduction with comfortable headroom.

Component	Per-GPU Memory
Model parameters	14.0 GB
Gradients	7.0 GB
Optimizer states	28.0 GB
KV cache	~12.0 GB
Activations	~7.5 GB
Framework overhead	3.7 GB
Total	~52 GB

What Gave the Biggest Win

The KV cache reduction was the single largest contributor. KV cache memory follows this formula:

KV_mem = 2 * num_layers * hidden_size * seq_len * batch_size * bytes_per_element

Halving the sequence length from 8192 to 4096 cuts KV cache by 50%. Since KV cache was the largest single component at ~24.5 GB in the original config, this alone saved over 12 GB per GPU. Everything in the formula is multiplicative, so cutting multiple factors compounds quickly.

The Trade-offs

Nothing is free. Here’s what the optimized config gives up:

Metric	8-Card Original	4-Card Optimized
Peak memory / GPU	~80 GB	~52 GB
Throughput (relative)	100%	65-70%
Max sequence length	8192	4096
Training stability	Critical (near limit)	Safe (28 GB headroom)

You lose 30-35% throughput and halve your max sequence length. Whether that’s acceptable depends on your task. For many fine-tuning and RLHF jobs, 4096 tokens is plenty.

Practical Tips and OOM Emergency Playbook

When you’re hitting OOM during training, here’s what to try, ordered by impact and minimal disruption to training dynamics:

1. Cut sequence length. Reduce max-response-len or max-seq-len. Going from 8192 to 3072 can free 10+ GB. Both KV cache and attention activations scale linearly (or worse) with sequence length, making this the single biggest lever you can pull.

2. Reduce micro-batch size, increase gradient accumulation. Same effective batch size, much less activation memory per step. The trade-off is more sequential steps per parameter update, which increases wall-clock time.

3. Reduce samples per prompt. If you’re doing RL-style training (GRPO, PPO), cutting n-samples-per-prompt directly reduces the number of rollouts held in memory simultaneously.

4. Increase activation checkpointing. Recompute more layers during backward. This is a pure compute-for-memory trade: more FLOPs, less stored activations. Most frameworks let you specify exactly how many layers to checkpoint.

5. Switch optimizer. If you’re truly desperate, 8-bit Adam (from bitsandbytes) or Adafactor can cut optimizer state memory significantly. But test carefully – optimizer changes can affect convergence, and you don’t want to discover that 200 hours into a run.

Quick Estimation Shortcut

For a back-of-envelope check before you even start writing configs:

def estimate_training_memory_gb(param_billions, precision_bytes=2, optimizer="adam"):
    params_gb = param_billions * precision_bytes
    grads_gb = param_billions * precision_bytes
    opt_bytes = {"sgd": 4, "adam": 8}[optimizer]
    opt_gb = param_billions * opt_bytes
    # Activations: rough lower bound at 1.5x param size
    act_gb = params_gb * 1.5
    subtotal = params_gb + grads_gb + opt_gb + act_gb
    return subtotal * 1.15  # 15% overhead buffer

# 7B model, bf16, Adam
print(f"{estimate_training_memory_gb(7):.1f} GB")  # ~73.6 GB

This won’t be precise for any specific configuration, but it tells you whether you’re in the right ballpark before you spend an hour writing launch scripts.

The bottom line: GPU memory estimation comes down to tracking all the components and knowing which ones dominate. Optimizer states are usually the surprise. Activations are the wildcard. And when things don’t fit, cut sequence length first – it’s almost always the highest-impact, lowest-disruption fix.