- mlsys
- gpu
- serving
- distributed-training
- rl
- engineering
•
•
•
•
•
NeMo-RL vs Slime — A Systems-Level Comparison
Deep comparison of two RL training frameworks — feature matrices, algorithmic differences, MoE readiness, and when to use which
Adding Sequence Parallelism to Slime's FSDP Backend
Design doc for integrating Ring-Attention based sequence parallelism into slime's FSDP training backend — architecture, tradeoffs, and RL coupling
A Practical Guide to LLM GPU Memory Estimation
How to estimate GPU memory for LLM training — precision formats, optimizer states, parallelism strategies, and a worked example with Qwen-2.5-7B on H100
SGLang as GRPO Inference Backend in TRL
Integrating SGLang into HuggingFace TRL for GRPO training — server-based rollout, distributed init fixes, and memory optimization
a post with bibliography
an example of a blog post with bibliography