SGLang AMD Cookbook

Model Deployment

Select a model below to see its launch configuration for both TP=4 and TP=8.

Qwen3.5-397B-A17B

Mixture-of-Experts — 397B total params, 17B active per token

MoE BF16 752 GB

HF Path

Qwen/Qwen3.5-397B-A17B

Decode Throughput

53–59 tok/s (TP8, BS1)

Attention Backend

triton

Status

Verified ✓

TP=8 Launch Server (Verified)

Tested end-to-end on 8× MI355X. Hybrid DeltaNet architecture: 45 recurrent + 15 GQA layers. --max-mamba-cache-size 128 is critical (+45% perf).

bash — docker run

docker run -d \
    --name qwen35-serve \
    --ipc=host --network=host --privileged \
    --shm-size 32G \
    --ulimit core=0:0 \
    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    -e HF_HUB_OFFLINE=1 \
    -e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
    -v /mnt/dcgpuval/huggingface:/sgl-workspace/models \
    sglang-test:v0.5.9-rocm700-mi35x-20260310 \
    python3 -m sglang.launch_server \
        --model-path /sgl-workspace/models/hub/models--Qwen--Qwen3.5-397B-A17B/snapshots/98d1a504ba52e88924b3a3a008447cf2fdbd518c \
        --served-model-name Qwen3.5-397B-A17B \
        --tp 8 \
        --trust-remote-code \
        --attention-backend triton \
        --mem-fraction-static 0.80 \
        --max-mamba-cache-size 128 \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder \
        --watchdog-timeout 1200 \
        --host 0.0.0.0 \
        --port 30000

Key flags: --max-mamba-cache-size 128 critical for DeltaNet’s 45 recurrent layers (+45% perf vs default 64); --tool-call-parser qwen3_coder for robust tool calling (10/10 multi-turn); --reasoning-parser qwen3 separates <think> blocks; CUDA graph enabled (default); SGLANG_ROCM_FUSED_DECODE_MLA=0 required for triton backend; --ulimit core=0:0 prevents GPU core dumps from filling disk on crash (~200 GB each).

BENCH bench_one_batch_server Results (TP=8, batch=1)

Input	Output	TTFT	Decode	Total Latency
1,024	512	0.11 s	58.90 tok/s	8.69 s
1,024	1,024	0.11 s	59.17 tok/s	17.31 s
8,192	512	0.31 s	55.71 tok/s	9.19 s
8,192	1,024	0.25 s	56.68 tok/s	18.07 s
16,384	512	0.51 s	52.32 tok/s	9.79 s
16,384	1,024	0.45 s	53.79 tok/s	19.04 s

Image: sglang-test:v0.5.9-rocm700-mi35x-20260310 • 8× MI355X • triton backend • CUDA graph on • mamba-cache=128 • March 2026

Known Issue: DeltaNet Mamba Memory Leak

Qwen3.5’s 45 DeltaNet recurrent layers can leak memory over extended runtime, causing NCCL process group failures after ~10–12 hours. See Issue #20010 / PR #20182. Workaround: periodic container restart (docker restart qwen35-serve).

Verification

bash

# Wait for "The server is fired up and ready to roll!"
docker logs -f qwen35-serve

# Health check
curl http://localhost:30000/health

# Inference test (use max_tokens >= 512 for reasoning models)
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3.5-397B-A17B",
      "messages": [{"role": "user", "content": "What is 2+2?"}],
      "max_tokens": 512,
      "temperature": 0.0
    }' | python3 -m json.tool

# Tool calling test
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3.5-397B-A17B",
      "messages": [{"role": "user", "content": "Read /etc/hostname"}],
      "tools": [{"type":"function","function":{"name":"ReadFile","description":"Read a file.","parameters":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}}}],
      "tool_choice": "auto",
      "max_tokens": 1024
    }' | python3 -m json.tool

LoRA SFT Serving the Fine-tuned Model (v4)

The LoRA SFT v4 adapter fine-tunes Qwen3.5 on AMD GPU kernel engineering trajectories (270 examples, rank 32, 13 target module types). SGLang’s runtime LoRA does not support this adapter’s DeltaNet recurrent modules (in_proj_a/b/z/qkv, out_proj) or MoE gate (shared_expert_gate). The workaround is to merge the adapter offline and serve the merged model as a standalone checkpoint.

Why Runtime LoRA Won’t Work

SGLang LoRA supports: q/k/v/o_proj, gate/up/down_proj. This adapter also targets 6 unsupported modules across Qwen3.5’s 45 DeltaNet layers and MoE gates. Attempting --lora-paths will fail at init_lora_shapes(). See sglang#9897.

Step 1: Merge LoRA into Base Model (offline, ~25 min)

Use LLaMA-Factory to merge. Requires ~800 GB RAM. Output: 743 GB (122 × 5 GB shards).

yaml — merge config (examples/merge_lora/qwen35_397b_lora_sft_amdpilot.yaml)

### model
model_name_or_path: /sgl-workspace/models/hub/models--Qwen--Qwen3.5-397B-A17B/snapshots/7cad2bae11cb49ca79f7d6a0954de2e2756f4e27
adapter_name_or_path: JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4
template: qwen3_5_nothink
trust_remote_code: true

### export
export_dir: /sgl-workspace/models/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged
export_size: 5
export_device: cpu
export_legacy_format: false

bash — run merge

llamafactory-cli export examples/merge_lora/qwen35_397b_lora_sft_amdpilot.yaml

Step 2: Serve Merged Model with SGLang

Identical to the base Qwen3.5 launch command, just swap --model-path and --served-model-name. All flags (--max-mamba-cache-size, --reasoning-parser, etc.) carry over.

bash — docker run

docker run -d \
    --name qwen35-sft-serve \
    --ipc=host --network=host --privileged \
    --shm-size 32G \
    --ulimit core=0:0 \
    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    -e HF_HUB_OFFLINE=1 \
    -e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
    -v /mnt/dcgpuval/huggingface:/sgl-workspace/models \
    sglang-test:v0.5.9-rocm700-mi35x-20260310 \
    python3 -m sglang.launch_server \
        --model-path /sgl-workspace/models/hub/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
        --served-model-name Qwen3.5-397B-A17B-SFT-v4 \
        --tp 8 \
        --trust-remote-code \
        --attention-backend triton \
        --mem-fraction-static 0.80 \
        --max-mamba-cache-size 128 \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder \
        --watchdog-timeout 1200 \
        --host 0.0.0.0 \
        --port 30000

Step 3: Verify

bash

# Wait for server ready
docker logs -f qwen35-sft-serve

# Health check
curl http://localhost:30000/health

# Test inference with the fine-tuned model
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3.5-397B-A17B-SFT-v4",
      "messages": [{"role": "user", "content": "What is 2+2?"}],
      "max_tokens": 512,
      "temperature": 0.0
    }' | python3 -m json.tool

Adapter Details

HF Adapter

JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4

Rank / Alpha

32 / 64

Trainable Params

128.5M (0.032%)

Merged Size

743 GB (122 shards)

Dataset

270 examples (3-view)

Best Eval Loss

0.0547 (epoch 8)

Target Modules

13 types (all)

NFS Path

hub/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged/

Kimi-K2.5

Mixture-of-Experts — 1T total params, 32B active per token, W4A16 quantized

MoE W4A16 555 GB

HF Path

moonshotai/Kimi-K2.5

Decode Throughput

42.6 tok/s optimized (TP8, BS1)

Attention Backend

triton decode + aiter prefill

Status

Verified ✓

BEST TP=8 Optimized Serving (Verified, 38.6% faster)

Achieves 23.5ms decode median (42.6 tok/s) vs 38.3ms baseline — a 38.6% improvement. Uses the stock ROCm 7.2 image with optimized sglang and aiter branches checked out inside the container. Three key optimizations: GEMM A16W16 small-M configs (BLOCK_SIZE_M=16–64 for M=1 decode), MoE Triton kernel configs (E=384, N=128 for MI355X), and hybrid attention (triton decode + aiter prefill).

Step 0: Pull Base Image

bash

docker pull rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317

Step 1: Start Container

bash — docker run

docker run -d --name kimi-k25-server \
    --device /dev/kfd --device /dev/dri \
    --shm-size 64g --network host \
    --cap-add SYS_PTRACE --group-add video \
    -e GPU_COREDUMP_ENABLE=0 \
    -e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
    -e SGLANG_USE_AITER=1 \
    -v /path/to/huggingface:/root/.cache/huggingface \
    rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317 \
    sleep infinity

Replace /path/to/huggingface with the parent directory containing the hub/ folder with models--moonshotai--Kimi-K2.5.

Step 2a: Apply Optimized sglang Branch (MoE Triton configs)

bash

docker exec kimi-k25-server bash -c '
cd /sgl-workspace/sglang
git stash && git clean -fd
git remote add fork https://github.com/Arist12/sglang.git
git fetch fork kimi-k25-optimize-v2
git checkout fork/kimi-k25-optimize-v2
cd python && pip install -e .
'

Step 2b: Apply Optimized aiter Branch (GEMM A16W16 small-M configs)

bash

docker exec kimi-k25-server bash -c '
cd /sgl-workspace/aiter
git stash && git clean -fd
git remote add fork https://github.com/Arist12/aiter.git
git fetch fork kimi-k25-optimize-v2
git checkout fork/kimi-k25-optimize-v2
pip install .
'

Must use pip install . (non-editable) for aiter. Using pip install -e . creates a broken namespace package that fails to resolve compiled C extensions. Verify with: python3 -c "from aiter import dynamic_per_tensor_quant; print('OK')"

Step 3: Launch Server

bash

docker exec -d kimi-k25-server bash -c '
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export SGLANG_USE_AITER=1
/opt/venv/bin/python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2.5 \
    --tp 8 \
    --trust-remote-code \
    --decode-attention-backend triton \
    --prefill-attention-backend aiter \
    --mem-fraction-static 0.85 \
    --reasoning-parser kimi_k2 \
    --tool-call-parser kimi_k2 \
    --host 0.0.0.0 --port 30000 \
    > /tmp/server.log 2>&1
'

Key flags: --decode-attention-backend triton + --prefill-attention-backend aiter hybrid attention (triton for decode, aiter ASM kernels for prefill); SGLANG_ROCM_FUSED_DECODE_MLA=0 required (image default is 1); SGLANG_USE_AITER=1 enables aiter prefill path; --mem-fraction-static 0.85 for max KV cache; --reasoning-parser kimi_k2 separates reasoning from content; --tool-call-parser kimi_k2 enables structured tool calling; CUDA graph enabled (default).

What the optimized branches change: sglang (Arist12/sglang:kimi-k25-optimize-v2): MoE Triton kernel configs for E=384,N=128 on MI355X with BLOCK_SIZE_M=16 for batch=1 decode. aiter (Arist12/aiter:kimi-k25-optimize-v2): GEMM A16W16 small-M configs (M_LEQ_4/8/16/32/64 with BLOCK_SIZE_M=16–64, default was 256).

BENCH Optimized Benchmark Results (TP=8, batch=1)

bench_one_batch (raw decode latency, input=8192, output=2048)

Configuration	Decode Median	Prefill	Improvement
Baseline (triton attn, default configs)	38.3 ms	—	—
+ AITER prefill attention	34.4 ms	—	10.2%
+ GEMM A16W16 small-M tuning	24.3 ms	—	36.6%
+ MoE Triton config tuning (final)	23.5 ms (42.6 tok/s)	637 ms (12,847 tok/s)	38.6%

bench_one_batch_server (server throughput, output=2048)

Input	Output	TTFT	Decode	Total Latency
1,024	2,048	0.27 s	45.19 tok/s	45.32 s
2,048	2,048	0.33 s	44.67 tok/s	45.84 s

Image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317 + optimized branches • 8× MI355X • triton decode + aiter prefill • CUDA graph on • March 2026

Previous Results: Grid Search (stock images, no optimized branches)

18 configs tested across 2 ROCm images with stock sglang/aiter. These results are superseded by the optimized config above.

rocm700 (v0.5.9-rocm700-mi35x-20260319)

Input	Output	TTFT	Decode	Total Latency
1,024	512	3.66 s	33.62 tok/s	18.86 s
1,024	1,024	1.54 s	35.54 tok/s	16.57 s
8,192	512	5.58 s	23.36 tok/s	27.41 s
8,192	1,024	5.57 s	23.31 tok/s	43.93 s

rocm720 (v0.5.9-rocm720-mi35x-20260320)

Input	Output	TTFT	Decode	Total Latency
1,024	512	1.40 s	26.47 tok/s	20.70 s
1,024	1,024	0.84 s	26.34 tok/s	39.67 s
8,192	512	4.17 s	19.48 tok/s	30.40 s
8,192	1,024	4.20 s	19.37 tok/s	21.95 s

ADDON Eagle3 Speculative Decoding

1.8× decode speedup with lightseekorg/kimi-k2.5-eagle3 (3B, 1 layer, BF16). Supports greedy and non-greedy (temp>0) on ROCm.

Eagle3 1.8×

bash — Eagle3 launch (add --speculative flags to the optimized server command)

# After applying the optimized branches (Steps 2a/2b above), launch with Eagle3:
docker exec -d kimi-k25-server bash -c '
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export SGLANG_USE_AITER=1
/opt/venv/bin/python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2.5 \
    --tp 8 --trust-remote-code \
    --decode-attention-backend triton \
    --prefill-attention-backend aiter \
    --mem-fraction-static 0.75 \
    --reasoning-parser kimi_k2 \
    --tool-call-parser kimi_k2 \
    --host 0.0.0.0 --port 30000 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
    > /tmp/server.log 2>&1
'

Must use --speculative-algorithm EAGLE3 (not EAGLE). Using EAGLE silently degrades accept_length to 1.0 because it skips the 3-layer aux hidden state capture that Eagle3 requires. See PR #21272 for auto-detection fix. Note --mem-fraction-static 0.75 (reduced from 0.85) to leave VRAM for the draft model.

Eagle3 Decode Throughput (TP=8, BS=1, math/coding tasks)

Task	Greedy (temp=0)	Non-greedy (temp=0.6)	vs Baseline
math_algebra	69.8 tok/s	66.2 tok/s	1.9×
math_series	69.3 tok/s	71.7 tok/s	2.0×
coding_fibonacci	66.6 tok/s	57.7 tok/s	1.6×
coding_sort	66.3 tok/s	56.0 tok/s	1.6×
coding_json	58.6 tok/s	55.8 tok/s	1.6×
AVERAGE	66.1 tok/s	61.5 tok/s	1.8× / 1.7×

Accept length: greedy 2.3–3.6, non-greedy 3.0–3.4 • Baseline: 35.8 tok/s (rocm700 stock) • --mem-fraction-static 0.75 required for Eagle3

Non-greedy on ROCm: Requires PyTorch fallback kernels for 3 missing sgl_kernel C++ ops. Apply via patch_eagle_rocm.py from this repo, or use PR #21275 once merged. Without the patch, temp>0 falls back to greedy silently.

Known Limitation: Long Context

Eagle3 accept_length degrades at 8K+ input tokens (falls to 1.6–2.0), making it slower than baseline. This is a draft model quality issue, not a serving bug. See #6783. Use Eagle3 for short-context, high-throughput workloads (coding, math, chat).

Verification

bash

# Monitor startup logs (wait for "The server is fired up and ready to roll!")
docker exec kimi-k25-server tail -f /tmp/server.log

# Health check
curl http://localhost:30000/health

# Chat completion (note reasoning_content field from --reasoning-parser kimi_k2)
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "moonshotai/Kimi-K2.5",
      "messages": [{"role": "user", "content": "Hello!"}],
      "max_tokens": 100
    }' | python3 -m json.tool

# Tool calling test (note tool_calls array from --tool-call-parser kimi_k2)
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "moonshotai/Kimi-K2.5",
      "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
      "tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather for a location","parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
      "max_tokens": 200
    }' | python3 -m json.tool

GLM-5-FP8

MoE + Native Sparse Attention (NSA) — 744B total params, 40B active, DeepSeek-V2 architecture

MoE NSA FP8 705 GB

HF Path

zai-org/GLM-5-FP8

Model Type

glm_moe_dsa

NSA Backend

tilelang

Day-0 PR

#18911

Required: glm_moe_dsa Config Registration Patch

GLM-5 uses a glm_moe_dsa model type that HuggingFace Transformers doesn’t recognize natively. This is registered in SGLang’s config loader (included in recent SGLang builds). Ensure your SGLang version includes the fix from PR #18911.

# Ensure latest transformers (for GLM-5 tokenizer support)
pip install --upgrade transformers

TP=8 Launch Server (Recommended)

bash

python3 -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --served-model-name glm-5-fp8 \
    --tp 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --mem-fraction-static 0.80 \
    --nsa-prefill-backend tilelang \
    --nsa-decode-backend tilelang \
    --chunked-prefill-size 131072 \
    --watchdog-timeout 1200 \
    --port 30000

TP=4 Launch Server (4 GPUs)

Tight fit: 705 GB model in 1,152 GB. Reduce --mem-fraction-static to leave room for KV cache. TP=8 strongly recommended for GLM-5.

bash

HIP_VISIBLE_DEVICES=0,1,2,3 python3 -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --served-model-name glm-5-fp8 \
    --tp 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --mem-fraction-static 0.60 \
    --nsa-prefill-backend tilelang \
    --nsa-decode-backend tilelang \
    --chunked-prefill-size 131072 \
    --disable-cuda-graph \
    --watchdog-timeout 1200 \
    --port 30000

Verification

bash

# Health check
curl http://localhost:30000/health

# List models
curl http://localhost:30000/v1/models | python3 -m json.tool

# Quick inference test
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "glm-5-fp8",
      "messages": [{"role": "user", "content": "Hello, who are you?"}],
      "max_tokens": 64
    }' | python3 -m json.tool

SGLang AMD
Cookbook

Prerequisites

System Preparation

Model Weight Paths

Docker Images

Model Deployment

Qwen3.5-397B-A17B

TP=8 Launch Server (Verified)

BENCH bench_one_batch_server Results (TP=8, batch=1)

Verification

LoRA SFT Serving the Fine-tuned Model (v4)

Kimi-K2.5

BEST TP=8 Optimized Serving (Verified, 38.6% faster)

BENCH Optimized Benchmark Results (TP=8, batch=1)

ADDON Eagle3 Speculative Decoding

Verification

GLM-5-FP8

TP=8 Launch Server (Recommended)

TP=4 Launch Server (4 GPUs)

Verification

Benchmarking

Test Matrix

Run Full Benchmark Suite

Individual Commands

Online Serving Benchmark (Optional)

References

SGLang PRs

Documentation

Model Pages

Input Tokens	Output Tokens	Batch Size	Measures
1,024	512	1	Prefill + decode latency
1,024	1,024	1	Short-context long-generation
8,192	512	1	Medium-context prefill stress
8,192	1,024	1	Medium-context full pipeline
16,384	512	1	Long-context prefill stress
16,384	1,024	1	Long-context full pipeline