NeMo-RL vs Slime — A Systems-Level Comparison
If you’re doing RLHF or GRPO-style training on large language models today, two frameworks sit at interesting ends of the spectrum: NVIDIA’s NeMo-RL and THUDM’s slime. Both handle rollout + training loops for policy optimization. Both support GRPO and importance sampling. But they make very different bets on what matters.
I spent time evaluating both, digging into their algorithmic implementations, testing infrastructure, and hardware support stories. This post is the result — a systems-level comparison that should help you decide which one fits your use case.
Who Should Use Which
The short version:
- Pick NeMo-RL if you need production-grade RL training with full DAPO support, strong CI guarantees, and you’re running on NVIDIA hardware. It’s enterprise-ready in ways that matter: deterministic builds, 32k+ lines of tests, type checking, one-command reproducibility.
- Pick slime if you’re doing research on SGLang-native rollout, need ROCm/MI300X support, or are training MoE models where routing determinism matters. It powers GLM-4.5 and GLM-4.6 in production, so it’s battle-tested at scale — just in a different way.
The longer version requires looking at 18+ features across six dimensions.
Feature Matrix
This is the centerpiece of the comparison. I tracked every feature I could find across both codebases:
| Feature | slime | NeMo-RL |
|---|---|---|
| GRPO | Yes | Yes |
| DAPO | Partial (dual-clip only) | Full |
| GSPO (seq-level IS) | Yes | Yes |
| DPO | No | Yes |
| MoE Routing Replay | Yes | No |
| Token-level IS / TIS | Yes (+custom TIS functions) | Yes (truncated IS) |
| Training Backends | Megatron | DTensor + Megatron |
| Sequence Packing | Balanced packing | 4 algorithms (MFFD/FFD/FFS/Concat) |
| ROCm Support | Yes (MI300X/MI325) | No |
| Rollout Backend | SGLang (primary) | vLLM (sync + async) |
| One-command Run | No | Yes (uv run) |
| Dependency Lock | No | Yes (uv.lock) |
| CI / Test Coverage | ~565 LOC | 32,661 LOC |
| Deterministic Rollout | Yes | No |
| Multi-Token Prediction Training | Yes | No |
| Fault Tolerance | Basic | Production-grade |
| Type Checking | No | Yes (pyrefly) |
| Docker Files | 3 | N/A (uv-based) |
A few things jump out immediately. NeMo-RL has 58x more test code than slime. That’s not a typo — 32,661 lines vs 565. On the other hand, slime has deterministic rollout and MoE routing replay, which NeMo-RL simply doesn’t offer. These aren’t minor features if you’re training mixture-of-experts models.
Algorithmic Design Differences
Both frameworks implement GRPO-family algorithms, but the internals diverge in ways that affect training dynamics.
Importance Sampling Granularity
slime computes sequence-level KL divergence and broadcasts it across tokens. The KL term is computed per-sequence, then applied uniformly to all tokens in that sequence. NeMo-RL takes the more direct approach: it computes a sequence-level importance ratio and uses it directly in the loss. The practical difference is subtle but real — slime’s broadcast approach can smooth out token-level noise, while NeMo-RL’s direct ratio gives you tighter per-sequence correction.
Clipping Strategies
slime uses PPO-style clipping with an additional --eps-clip-high flag for Clip-Higher (the DAPO-style asymmetric clipping). NeMo-RL exposes explicit ratio_clip_min and ratio_clip_max parameters, which aligns directly with the DAPO paper’s formulation.
This matters because DAPO’s clip ranges are asymmetric by design — you want to clip the ratio differently above and below 1.0. NeMo-RL’s explicit min/max parameterization maps cleanly to this. slime gets you partway there with the dual-clip approach, but the full DAPO algorithm (including overlong penalty and dynamic sampling) is only complete in NeMo-RL.
Off-Policy Correction
slime implements Token-level Importance Sampling (TIS) with support for custom correction functions. You can plug in your own TIS function, which is useful for research. NeMo-RL provides both token-level and sequence-level IS with truncation — more standard, less flexible.
Length Control
NeMo-RL includes DAPO’s overlong penalty out of the box: sequences that exceed the target length get a shaped negative reward. slime has no built-in length shaping. If length control matters for your task (and it usually does for code generation or math reasoning), NeMo-RL saves you from implementing this yourself.
MoE Readiness and Routing Replay
This is where slime has a genuine architectural advantage.
When you’re training a mixture-of-experts model like DeepSeek-V3 or Qwen3-MoE, the expert routing decisions during rollout need to be reproducible during the training forward pass. If routing differs between rollout and training, the gradients are computed against a different computation graph than the one that generated the outputs. This is a correctness problem, not just a performance one.
slime solves this with routing replay: it captures the expert routing decisions during SGLang rollout and replays them during the Megatron training forward pass. Combined with --advantage-estimator gspo, this gives you deterministic expert assignments across the rollout-training boundary.
NeMo-RL supports sequence_level_importance_ratios for off-policy correction, which helps with the staleness problem, but it doesn’t replay routing decisions. If you’re training large MoE models and care about gradient correctness, this gap is significant.
ROCm vs CUDA
slime runs on both CUDA and ROCm. Specifically, it supports MI300X and MI325 GPUs with AITER flash-attention kernels, and ships three Docker files for different hardware configurations. I’ve run slime on MI300X hardware and it works — this isn’t vaporware.
| slime | NeMo-RL | |
|---|---|---|
| CUDA | Yes | Yes |
| ROCm | Yes (MI300X/MI325) | No |
| Flash-Attn | AITER (ROCm) + standard (CUDA) | flash-attn 2.8.1 (CUDA only) |
| ROCm Docker | Yes (3 Dockerfiles) | No |
NeMo-RL is CUDA-only. No ROCm support, no AMD GPU path. If you’re in an environment with AMD hardware (and that’s becoming more common as MI300X deployments grow), slime is currently your only option between these two. This is a hard constraint, not a preference.
Developer Experience and Reproducibility
NeMo-RL wins this category convincingly.
Reproducibility: NeMo-RL uses uv for dependency management with a locked uv.lock file. You run uv run and get a deterministic environment. slime relies on Docker for reproducibility, which works but is heavier and less composable.
# NeMo-RL: one command
uv run python examples/run_grpo_math.py
# slime: Docker-based
docker pull slimerl/slime:latest
Testing: NeMo-RL has 18 CI workflows and 32,661 lines of test code. slime has 4 CI workflows and ~565 lines. The gap is enormous. NeMo-RL’s test suite includes multi-tier tests (L0/L1/L2), integration tests, unit tests, and regression tests across multiple configurations. slime’s testing is mostly end-to-end.
| Metric | slime | NeMo-RL |
|---|---|---|
| Test LOC | ~565 | 32,661 |
| CI Workflows | 4 | 18 |
| Test Tiers | E2E only | L0/L1/L2 |
| Dep Lock | No | uv.lock |
| Type Checking | No | pyrefly |
Type safety: NeMo-RL uses pyrefly for type checking. slime doesn’t have type checking. For a training framework where a shape mismatch can silently produce wrong gradients, type safety is worth the overhead.
Documentation: NeMo-RL has comprehensive docs, examples, and a clear getting-started path. slime’s documentation is thinner — you’ll be reading source code more often.
I want to be fair here: slime is a research framework maintained by a smaller team, and it prioritizes shipping features over developer experience polish. That’s a reasonable tradeoff for a research project. But if you’re onboarding a team or need to debug a training run at 3am, NeMo-RL’s infrastructure investment pays off. And slime counters with something NeMo-RL can’t claim: GLM-4.5 and GLM-4.6 were trained with it, so it has real production mileage.
Migration Paths
If you’re considering switching between the two, here’s what each direction looks like.
slime to NeMo-RL: The main risk is losing MoE routing replay and ROCm support. You gain DAPO completeness, better testing, and one-command reproducibility. On the practical side: prefer DTensor (HF format) for weights, map --eps-clip/kl-coef to NeMo-RL’s YAML ratio_clip_* and reference_policy_kl_penalty parameters, and swap SGLang rollout for vLLM. The rollout backend switch is nontrivial if your serving infrastructure is SGLang-native.
NeMo-RL to slime: You gain SGLang rollout, ROCm support, and MoE determinism. You lose DAPO’s full algorithm (overlong penalty, dynamic sampling), the extensive test suite, DPO support, and dependency locking. You’ll need to convert HF weights to Megatron torch_dist format and translate YAML configs to bash-style configuration. The training backend stays on Megatron, but you lose the DTensor option.
Neither migration is trivial. The algorithmic differences in IS computation and clipping mean you can’t just swap backends and expect identical training dynamics.
Verdict
These frameworks aren’t competing for the same niche, despite the surface-level overlap.
slime is the right choice when you need SGLang-native rollout, AMD GPU support, or MoE training with routing determinism. It’s research-optimized and battle-tested on large models (GLM-4.5/4.6). The codebase is lean and moves fast, but the testing and DX story is thin.
NeMo-RL is the right choice when you need production-grade training infrastructure on NVIDIA hardware. Full DAPO support, 58x more test coverage, deterministic builds, type checking — it’s what you pick when reliability at scale matters more than research agility.
The ideal scenario is to steal ideas from both. slime’s routing replay belongs in every MoE training framework. NeMo-RL’s testing discipline and DAPO completeness should be the baseline for any RL training system. Neither framework has the complete picture yet, but between the two of them, most of the important pieces exist somewhere.