SGLang as GRPO Inference Backend in TRL

Server-Based Multi-Turn Rollout via veRL Engine
- Why a Server
- The Lifecycle
Integrating SGLang into TRL’s GRPO Pipeline
The Hard Parts
- Distributed Engine Init Failure
- Memory Blowup During Inference
What I Learned

RL training for language models has a rollout problem. Every GRPO training step needs to generate completions from the current policy, score them, and then update weights based on the group-relative advantages. The generation part – the rollout – is where things get ugly.

Most frameworks handle rollout with an offline inference engine. You load the model, generate a batch of completions, and move on. This works for single-turn tasks, but it falls apart in two specific ways once you try to scale or add multi-turn interactions:

Concurrency is terrible. Each batch blocks until every sequence finishes. If one prompt in the batch triggers a long chain-of-thought and the rest finish fast, everyone waits. At scale, this wastes a lot of GPU time.
Agent frameworks can’t plug in. Most agent and environment tooling expects to hit an HTTP endpoint with a base URL and API key. An offline engine doesn’t expose that. So you end up writing glue code to fake a server interface, or you rebuild the agent loop from scratch.

Over the past few months, I worked on two projects that attack both problems: a server-based rollout architecture on top of veRL Engine, and a direct integration of SGLang into HuggingFace TRL as the GRPO inference backend.

Server-Based Multi-Turn Rollout via veRL Engine

The idea came out of discussions around veRL issue #385. I worked with Haoran, Chengxing, and Chenyang on the design and implementation.

The core insight is straightforward: if you already need a server for multi-turn rollout (because the Actor has to repeatedly interact with an Environment), then just make the server the primary rollout mechanism. Don’t bolt it on as an afterthought.

Why a Server

Three properties matter:

Native concurrency. A server handles concurrent requests out of the box. Multiple rollout workers can hit the same endpoint without synchronizing. No batch-level blocking.

Cross-process Actor-Environment communication. The Actor exposes a port. The Environment connects to it. During multi-turn rollout, they go back and forth – the Actor generates, the Environment responds, the Actor generates again. This is just HTTP calls. No shared memory hacks, no IPC pipes.

Independent Environment lifecycle. The Environment process doesn’t need to live inside the training loop. It starts, it connects, it runs. If you want to swap the Environment (different reward model, different tool set), you don’t touch the training code.

The Lifecycle

Here’s how a single training step works:

Training Init
  └─ Start all servers
  └─ Offload model params to CPU

Rollout Phase
  └─ Sync Actor weights → Server
  └─ Offload Actor (free GPU memory)
  └─ Server loads model params
  └─ Server spins up Environment
  └─ Enter rollout loop:
       Actor generates → Env responds → Actor generates → ...
  └─ Collect trajectories
  └─ Async batching across rollout workers

Post-Rollout
  └─ Offload server params → CPU
  └─ Resume training on collected trajectories

The key detail is the memory dance. At any given moment, either the training process or the server process owns the GPU. They trade off by staging params through CPU. It’s not elegant, but it means you don’t need 2x the GPU memory.

The async batching is also worth calling out. During rollout, different sequences finish at different times. Instead of blocking until the whole batch completes, the server can feed finished trajectories to a buffer and start new ones. This is where most of the throughput gain comes from compared to offline engines.

Integrating SGLang into TRL’s GRPO Pipeline

The second project was more concrete: get SGLang working as a first-class inference backend in HuggingFace’s TRL framework for GRPO training. This was part of the broader Open-R1 effort. I worked with Qiujiang and Chenyang, and the result is PR #2981.

TRL already had a vLLM backend path. The job was to build an equivalent path for SGLang. Four main pieces:

1. Get SGLang Server running through the full GRPO loop

The GRPO trainer in TRL has a specific sequence: init, generate completions, compute rewards, compute advantages, update policy. The SGLang server needs to be alive and responsive throughout, and it needs to handle weight updates between training steps.

2. Initialize the offline engine and process group

At startup, we need to bring up an SGLang engine instance and set up the distributed process group so the engine can communicate with the training processes. The init looks roughly like this:

# Manually initialize distributed backend
# (bypassing HF Accelerate -- more on this below)
if not dist.is_initialized():
    dist.init_process_group(
        backend="nccl",
        init_method=f"tcp://{master_addr}:{master_port}",
        world_size=world_size,
        rank=rank,
    )

# Launch SGLang engine
engine = sgl.Engine(
    model_path=model_path,
    tp_size=tp_size,
    mem_fraction_static=mem_fraction,
)

3. Implement weight synchronization

The _update_sglang_engine_weights function is the analog of TRL’s existing _move_model_to_vllm. After each training step, the policy weights have changed. We need to push those updates into the running SGLang engine without restarting it.

def _update_sglang_engine_weights(self):
    """Push updated policy weights to the SGLang engine."""
    for name, param in self.model.named_parameters():
        # Get the weight in the format SGLang expects
        weight_data = param.data.to(self.engine_dtype)
        self.engine.update_weights_from_tensor(
            name, weight_data
        )
    # Free intermediate memory
    torch.cuda.empty_cache()

4. Handle generation in `_prepare_inputs()`

The _prepare_inputs method packages prompts and sends them to the SGLang engine for completion. The responses come back as token IDs and get stitched back into the training batch format that TRL’s GRPO loss function expects.

The Hard Parts

The integration looked simple on paper. Two issues ate most of the debugging time.

Distributed Engine Init Failure

When running in a distributed training setup (multiple GPUs, HuggingFace Accelerate managing the processes), SGLang’s engine initialization would just fail. No useful error message – just a hang during NCCL initialization.

The root cause: HuggingFace Accelerate initializes its own distributed process group early in the training setup. When SGLang then tries to initialize its own NCCL communicator, the two collide. They’re both trying to manage the same GPU resources and the same network ports.

The fix was blunt: skip Accelerate’s distributed init entirely and do it manually before anything else runs.

# Don't let Accelerate touch distributed init
os.environ["ACCELERATE_BYPASS_DEVICE_MAP"] = "true"

# Manual distributed setup
dist.init_process_group(
    backend="nccl",
    init_method=f"tcp://{master_addr}:{find_free_port()}",
    world_size=world_size,
    rank=rank,
)

# Now SGLang can init without conflicts
engine = sgl.Engine(model_path=model_path, tp_size=tp_size)

This works, but it’s a workaround. HuggingFace was apparently discussing whether to make Accelerate more flexible about process group ownership. I don’t know where that landed.

Memory Blowup During Inference

The second problem was memory. During the rollout phase, GPU memory usage would spike well beyond what the model weights alone should require. On an 8-GPU setup, we were running out of memory on configurations that should have had plenty of headroom.

The issue was in the weight update path. There are three ways to update weights in SGLang:

Function	Source	Memory Profile
`update_weights_from_disk`	Checkpoint on disk	Low GPU, high I/O
`update_weights_from_tensor`	In-memory tensors	Peaks at 2x model size
`update_weights_from_distributed`	Cross-process NCCL	Low per-GPU, needs coordination

We were using update_weights_from_tensor naively, which meant the engine held both the old weights and the new weights in GPU memory simultaneously during the update. For a 7B model on a single GPU, that’s ~28GB just for the weight swap.

The fix was a staged approach:

Before the update, offload the training model’s optimizer states to CPU.
Update weights parameter-by-parameter (not all at once), freeing each old tensor before allocating the next new one.
After the update completes, call torch.cuda.empty_cache() to return fragmented memory to the allocator.
For larger models, fall back to update_weights_from_distributed which shards the transfer across GPUs.

def _update_sglang_engine_weights(self):
    """Memory-efficient weight sync to SGLang engine."""
    # Stage 1: free training memory
    self._offload_optimizer_states()

    # Stage 2: stream weights parameter by parameter
    for name, param in self.model.named_parameters():
        weight = param.data.to(self.engine_dtype)
        if self.use_distributed_update:
            self.engine.update_weights_from_distributed(
                name, weight, src_rank=0
            )
        else:
            self.engine.update_weights_from_tensor(name, weight)
        del weight

    # Stage 3: reclaim fragmented memory
    torch.cuda.empty_cache()

This brought peak memory usage down to roughly 1.1x model size during updates, which was enough to run comfortably on our target hardware.

What I Learned

Process group conflicts are the #1 pain point for multi-framework integration. If you’re combining any two systems that both want to manage NCCL (training framework + serving engine, or training framework + custom communication layer), expect to spend time debugging silent hangs. Manual init is ugly but reliable.

Memory optimization in weight updates is about staging, not algorithms. The three update functions aren’t better or worse than each other in absolute terms. What matters is when you call them relative to when you free other memory. The sequencing is the optimization.

Server-based rollout should be the default for RL training. The offline engine approach made sense when RL for LLMs was single-turn and small-scale. Once you add multi-turn interaction or scale past a handful of GPUs, the server architecture wins on both throughput and flexibility. I expect most frameworks will move in this direction.

The PR is at huggingface/trl#2981 if you want to look at the actual code.