Copy-paste ready commands for deploying and benchmarking Qwen3.5, Kimi-K2.5, and GLM-5 on AMD Instinct MI355X GPUs.
Disable NUMA balancing for optimal GPU performance.
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
# Verify
cat /proc/sys/kernel/numa_balancing # Should output 0
Host path /mnt/dcgpuval/huggingface → bind-mounted to /sgl-workspace/models inside containers.
Two base images are available. The ROCm 7.2 image (20260317) is recommended for Kimi-K2.5 with optimized branches.
For Qwen3.5, use the ROCm 7.0 image (20260310) with the fixed Dockerfile.
# ROCm 7.2 image (recommended for Kimi-K2.5 optimized config)
docker pull rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317
# ROCm 7.0 image (for Qwen3.5 / GLM-5)
docker pull rocm/sgl-dev:v0.5.9-rocm700-mi35x-20260310
# Clone the cookbook repo (contains Dockerfile + helper scripts)
git clone https://github.com/jhinpan/sglang-cookbook.git && cd sglang-cookbook
# Build fixed image: disables broken aiter, patches quark imports
docker build -t sglang-test:v0.5.9-rocm700-mi35x-20260310 \
--build-arg BASE=rocm/sgl-dev:v0.5.9-rocm700-mi35x-20260310 \
-f Dockerfile.bisect .
The base image sets SGLANG_ROCM_FUSED_DECODE_MLA=1, which crashes the triton attention backend with a ForwardMetadata unpacking error.
Override to 0 via -e SGLANG_ROCM_FUSED_DECODE_MLA=0 at docker run time (shown in model configs below).
Select a model below to see its launch configuration for both TP=4 and TP=8.
Mixture-of-Experts — 397B total params, 17B active per token
Tested end-to-end on 8× MI355X. Hybrid DeltaNet architecture: 45 recurrent + 15 GQA layers. --max-mamba-cache-size 128 is critical (+45% perf).
docker run -d \
--name qwen35-serve \
--ipc=host --network=host --privileged \
--shm-size 32G \
--ulimit core=0:0 \
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
--device=/dev/kfd --device=/dev/dri \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
-e HF_HUB_OFFLINE=1 \
-e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
-v /mnt/dcgpuval/huggingface:/sgl-workspace/models \
sglang-test:v0.5.9-rocm700-mi35x-20260310 \
python3 -m sglang.launch_server \
--model-path /sgl-workspace/models/hub/models--Qwen--Qwen3.5-397B-A17B/snapshots/98d1a504ba52e88924b3a3a008447cf2fdbd518c \
--served-model-name Qwen3.5-397B-A17B \
--tp 8 \
--trust-remote-code \
--attention-backend triton \
--mem-fraction-static 0.80 \
--max-mamba-cache-size 128 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--watchdog-timeout 1200 \
--host 0.0.0.0 \
--port 30000
Key flags:
--max-mamba-cache-size 128 critical for DeltaNet’s 45 recurrent layers (+45% perf vs default 64);
--tool-call-parser qwen3_coder for robust tool calling (10/10 multi-turn);
--reasoning-parser qwen3 separates <think> blocks;
CUDA graph enabled (default);
SGLANG_ROCM_FUSED_DECODE_MLA=0 required for triton backend;
--ulimit core=0:0 prevents GPU core dumps from filling disk on crash (~200 GB each).
| Input | Output | TTFT | Decode | Total Latency |
|---|---|---|---|---|
| 1,024 | 512 | 0.11 s | 58.90 tok/s | 8.69 s |
| 1,024 | 1,024 | 0.11 s | 59.17 tok/s | 17.31 s |
| 8,192 | 512 | 0.31 s | 55.71 tok/s | 9.19 s |
| 8,192 | 1,024 | 0.25 s | 56.68 tok/s | 18.07 s |
| 16,384 | 512 | 0.51 s | 52.32 tok/s | 9.79 s |
| 16,384 | 1,024 | 0.45 s | 53.79 tok/s | 19.04 s |
Image: sglang-test:v0.5.9-rocm700-mi35x-20260310 • 8× MI355X • triton backend • CUDA graph on • mamba-cache=128 • March 2026
Qwen3.5’s 45 DeltaNet recurrent layers can leak memory over extended runtime, causing NCCL process group failures after ~10–12 hours.
See Issue #20010 /
PR #20182.
Workaround: periodic container restart (docker restart qwen35-serve).
# Wait for "The server is fired up and ready to roll!"
docker logs -f qwen35-serve
# Health check
curl http://localhost:30000/health
# Inference test (use max_tokens >= 512 for reasoning models)
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-397B-A17B",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 512,
"temperature": 0.0
}' | python3 -m json.tool
# Tool calling test
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-397B-A17B",
"messages": [{"role": "user", "content": "Read /etc/hostname"}],
"tools": [{"type":"function","function":{"name":"ReadFile","description":"Read a file.","parameters":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}}}],
"tool_choice": "auto",
"max_tokens": 1024
}' | python3 -m json.tool
The LoRA SFT v4 adapter
fine-tunes Qwen3.5 on AMD GPU kernel engineering trajectories (270 examples, rank 32, 13 target module types).
SGLang’s runtime LoRA does not support this adapter’s DeltaNet recurrent modules
(in_proj_a/b/z/qkv, out_proj)
or MoE gate (shared_expert_gate).
The workaround is to merge the adapter offline and serve the merged model as a standalone checkpoint.
SGLang LoRA supports: q/k/v/o_proj, gate/up/down_proj.
This adapter also targets 6 unsupported modules across Qwen3.5’s 45 DeltaNet layers and MoE gates.
Attempting --lora-paths will fail at init_lora_shapes().
See sglang#9897.
Use LLaMA-Factory to merge. Requires ~800 GB RAM. Output: 743 GB (122 × 5 GB shards).
### model
model_name_or_path: /sgl-workspace/models/hub/models--Qwen--Qwen3.5-397B-A17B/snapshots/7cad2bae11cb49ca79f7d6a0954de2e2756f4e27
adapter_name_or_path: JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4
template: qwen3_5_nothink
trust_remote_code: true
### export
export_dir: /sgl-workspace/models/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged
export_size: 5
export_device: cpu
export_legacy_format: false
llamafactory-cli export examples/merge_lora/qwen35_397b_lora_sft_amdpilot.yaml
Identical to the base Qwen3.5 launch command, just swap --model-path and --served-model-name.
All flags (--max-mamba-cache-size, --reasoning-parser, etc.) carry over.
docker run -d \
--name qwen35-sft-serve \
--ipc=host --network=host --privileged \
--shm-size 32G \
--ulimit core=0:0 \
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
--device=/dev/kfd --device=/dev/dri \
--group-add video \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
-e HF_HUB_OFFLINE=1 \
-e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
-v /mnt/dcgpuval/huggingface:/sgl-workspace/models \
sglang-test:v0.5.9-rocm700-mi35x-20260310 \
python3 -m sglang.launch_server \
--model-path /sgl-workspace/models/hub/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
--served-model-name Qwen3.5-397B-A17B-SFT-v4 \
--tp 8 \
--trust-remote-code \
--attention-backend triton \
--mem-fraction-static 0.80 \
--max-mamba-cache-size 128 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--watchdog-timeout 1200 \
--host 0.0.0.0 \
--port 30000
# Wait for server ready
docker logs -f qwen35-sft-serve
# Health check
curl http://localhost:30000/health
# Test inference with the fine-tuned model
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-397B-A17B-SFT-v4",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"max_tokens": 512,
"temperature": 0.0
}' | python3 -m json.tool
Mixture-of-Experts — 1T total params, 32B active per token, W4A16 quantized
Achieves 23.5ms decode median (42.6 tok/s) vs 38.3ms baseline — a 38.6% improvement. Uses the stock ROCm 7.2 image with optimized sglang and aiter branches checked out inside the container. Three key optimizations: GEMM A16W16 small-M configs (BLOCK_SIZE_M=16–64 for M=1 decode), MoE Triton kernel configs (E=384, N=128 for MI355X), and hybrid attention (triton decode + aiter prefill).
docker pull rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317
docker run -d --name kimi-k25-server \
--device /dev/kfd --device /dev/dri \
--shm-size 64g --network host \
--cap-add SYS_PTRACE --group-add video \
-e GPU_COREDUMP_ENABLE=0 \
-e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
-e SGLANG_USE_AITER=1 \
-v /path/to/huggingface:/root/.cache/huggingface \
rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317 \
sleep infinity
Replace /path/to/huggingface with the parent directory containing the hub/ folder with models--moonshotai--Kimi-K2.5.
docker exec kimi-k25-server bash -c '
cd /sgl-workspace/sglang
git stash && git clean -fd
git remote add fork https://github.com/Arist12/sglang.git
git fetch fork kimi-k25-optimize-v2
git checkout fork/kimi-k25-optimize-v2
cd python && pip install -e .
'
docker exec kimi-k25-server bash -c '
cd /sgl-workspace/aiter
git stash && git clean -fd
git remote add fork https://github.com/Arist12/aiter.git
git fetch fork kimi-k25-optimize-v2
git checkout fork/kimi-k25-optimize-v2
pip install .
'
Must use pip install . (non-editable) for aiter.
Using pip install -e . creates a broken namespace package that fails to resolve compiled C extensions.
Verify with: python3 -c "from aiter import dynamic_per_tensor_quant; print('OK')"
docker exec -d kimi-k25-server bash -c '
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export SGLANG_USE_AITER=1
/opt/venv/bin/python3 -m sglang.launch_server \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--trust-remote-code \
--decode-attention-backend triton \
--prefill-attention-backend aiter \
--mem-fraction-static 0.85 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--host 0.0.0.0 --port 30000 \
> /tmp/server.log 2>&1
'
Key flags:
--decode-attention-backend triton + --prefill-attention-backend aiter hybrid attention (triton for decode, aiter ASM kernels for prefill);
SGLANG_ROCM_FUSED_DECODE_MLA=0 required (image default is 1);
SGLANG_USE_AITER=1 enables aiter prefill path;
--mem-fraction-static 0.85 for max KV cache;
--reasoning-parser kimi_k2 separates reasoning from content;
--tool-call-parser kimi_k2 enables structured tool calling;
CUDA graph enabled (default).
What the optimized branches change:
sglang (Arist12/sglang:kimi-k25-optimize-v2):
MoE Triton kernel configs for E=384,N=128 on MI355X with BLOCK_SIZE_M=16 for batch=1 decode.
aiter (Arist12/aiter:kimi-k25-optimize-v2):
GEMM A16W16 small-M configs (M_LEQ_4/8/16/32/64 with BLOCK_SIZE_M=16–64, default was 256).
bench_one_batch (raw decode latency, input=8192, output=2048)
| Configuration | Decode Median | Prefill | Improvement |
|---|---|---|---|
| Baseline (triton attn, default configs) | 38.3 ms | — | — |
| + AITER prefill attention | 34.4 ms | — | 10.2% |
| + GEMM A16W16 small-M tuning | 24.3 ms | — | 36.6% |
| + MoE Triton config tuning (final) | 23.5 ms (42.6 tok/s) | 637 ms (12,847 tok/s) | 38.6% |
bench_one_batch_server (server throughput, output=2048)
| Input | Output | TTFT | Decode | Total Latency |
|---|---|---|---|---|
| 1,024 | 2,048 | 0.27 s | 45.19 tok/s | 45.32 s |
| 2,048 | 2,048 | 0.33 s | 44.67 tok/s | 45.84 s |
Image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317 + optimized branches • 8× MI355X • triton decode + aiter prefill • CUDA graph on • March 2026
18 configs tested across 2 ROCm images with stock sglang/aiter. These results are superseded by the optimized config above.
rocm700 (v0.5.9-rocm700-mi35x-20260319)
| Input | Output | TTFT | Decode | Total Latency |
|---|---|---|---|---|
| 1,024 | 512 | 3.66 s | 33.62 tok/s | 18.86 s |
| 1,024 | 1,024 | 1.54 s | 35.54 tok/s | 16.57 s |
| 8,192 | 512 | 5.58 s | 23.36 tok/s | 27.41 s |
| 8,192 | 1,024 | 5.57 s | 23.31 tok/s | 43.93 s |
rocm720 (v0.5.9-rocm720-mi35x-20260320)
| Input | Output | TTFT | Decode | Total Latency |
|---|---|---|---|---|
| 1,024 | 512 | 1.40 s | 26.47 tok/s | 20.70 s |
| 1,024 | 1,024 | 0.84 s | 26.34 tok/s | 39.67 s |
| 8,192 | 512 | 4.17 s | 19.48 tok/s | 30.40 s |
| 8,192 | 1,024 | 4.20 s | 19.37 tok/s | 21.95 s |
1.8× decode speedup with lightseekorg/kimi-k2.5-eagle3 (3B, 1 layer, BF16). Supports greedy and non-greedy (temp>0) on ROCm.
# After applying the optimized branches (Steps 2a/2b above), launch with Eagle3:
docker exec -d kimi-k25-server bash -c '
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export SGLANG_USE_AITER=1
/opt/venv/bin/python3 -m sglang.launch_server \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 --trust-remote-code \
--decode-attention-backend triton \
--prefill-attention-backend aiter \
--mem-fraction-static 0.75 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--host 0.0.0.0 --port 30000 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
> /tmp/server.log 2>&1
'
Must use --speculative-algorithm EAGLE3 (not EAGLE).
Using EAGLE silently degrades accept_length to 1.0 because it skips the 3-layer aux hidden state capture that Eagle3 requires.
See PR #21272 for auto-detection fix.
Note --mem-fraction-static 0.75 (reduced from 0.85) to leave VRAM for the draft model.
Eagle3 Decode Throughput (TP=8, BS=1, math/coding tasks)
| Task | Greedy (temp=0) | Non-greedy (temp=0.6) | vs Baseline |
|---|---|---|---|
| math_algebra | 69.8 tok/s | 66.2 tok/s | 1.9× |
| math_series | 69.3 tok/s | 71.7 tok/s | 2.0× |
| coding_fibonacci | 66.6 tok/s | 57.7 tok/s | 1.6× |
| coding_sort | 66.3 tok/s | 56.0 tok/s | 1.6× |
| coding_json | 58.6 tok/s | 55.8 tok/s | 1.6× |
| AVERAGE | 66.1 tok/s | 61.5 tok/s | 1.8× / 1.7× |
Accept length: greedy 2.3–3.6, non-greedy 3.0–3.4 • Baseline: 35.8 tok/s (rocm700 stock) • --mem-fraction-static 0.75 required for Eagle3
Non-greedy on ROCm: Requires PyTorch fallback kernels for 3 missing sgl_kernel C++ ops.
Apply via patch_eagle_rocm.py from this repo, or use PR #21275 once merged.
Without the patch, temp>0 falls back to greedy silently.
Eagle3 accept_length degrades at 8K+ input tokens (falls to 1.6–2.0), making it slower than baseline. This is a draft model quality issue, not a serving bug. See #6783. Use Eagle3 for short-context, high-throughput workloads (coding, math, chat).
# Monitor startup logs (wait for "The server is fired up and ready to roll!")
docker exec kimi-k25-server tail -f /tmp/server.log
# Health check
curl http://localhost:30000/health
# Chat completion (note reasoning_content field from --reasoning-parser kimi_k2)
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}' | python3 -m json.tool
# Tool calling test (note tool_calls array from --tool-call-parser kimi_k2)
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather for a location","parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
"max_tokens": 200
}' | python3 -m json.tool
MoE + Native Sparse Attention (NSA) — 744B total params, 40B active, DeepSeek-V2 architecture
GLM-5 uses a glm_moe_dsa model type that HuggingFace Transformers doesn’t recognize natively.
This is registered in SGLang’s config loader (included in recent SGLang builds). Ensure your SGLang version includes the fix from PR
#18911.
# Ensure latest transformers (for GLM-5 tokenizer support)
pip install --upgrade transformers
python3 -m sglang.launch_server \
--model-path zai-org/GLM-5-FP8 \
--served-model-name glm-5-fp8 \
--tp 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.80 \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--chunked-prefill-size 131072 \
--watchdog-timeout 1200 \
--port 30000
Tight fit: 705 GB model in 1,152 GB. Reduce --mem-fraction-static to leave room for KV cache. TP=8 strongly recommended for GLM-5.
HIP_VISIBLE_DEVICES=0,1,2,3 python3 -m sglang.launch_server \
--model-path zai-org/GLM-5-FP8 \
--served-model-name glm-5-fp8 \
--tp 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.60 \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--chunked-prefill-size 131072 \
--disable-cuda-graph \
--watchdog-timeout 1200 \
--port 30000
# Health check
curl http://localhost:30000/health
# List models
curl http://localhost:30000/v1/models | python3 -m json.tool
# Quick inference test
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5-fp8",
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"max_tokens": 64
}' | python3 -m json.tool
Use bench_one_batch_server
to measure single-batch latency including HTTP and scheduler overhead. Connect to a running server.
| Input Tokens | Output Tokens | Batch Size | Measures |
|---|---|---|---|
| 1,024 | 512 | 1 | Prefill + decode latency |
| 1,024 | 1,024 | 1 | Short-context long-generation |
| 8,192 | 512 | 1 | Medium-context prefill stress |
| 8,192 | 1,024 | 1 | Medium-context full pipeline |
| 16,384 | 512 | 1 | Long-context prefill stress |
| 16,384 | 1,024 | 1 | Long-context full pipeline |
Iterates through all input/output length combinations. Assumes server is running on port 30000.
for INPUT_LEN in 1024 8192 16384; do
for OUTPUT_LEN in 512 1024; do
echo "====== Input: ${INPUT_LEN}, Output: ${OUTPUT_LEN} ======"
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len $INPUT_LEN \
--output-len $OUTPUT_LEN
echo ""
done
done
Pick and run a specific combination.
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len 1024 \
--output-len 512
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len 1024 \
--output-len 1024
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len 8192 \
--output-len 512
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len 8192 \
--output-len 1024
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len 16384 \
--output-len 512
python3 -m sglang.bench_one_batch_server \
--model None \
--base-url http://localhost:30000 \
--batch-size 1 \
--input-len 16384 \
--output-len 1024
For more realistic multi-client throughput testing, use bench_serving instead.
CON="16 32 64 128"
ISL=3200
OSL=800
for con in $CON; do
PROMPTS=$(($con * 5))
python3 -m sglang.bench_serving \
--dataset-name random \
--random-input-len $ISL \
--random-output-len $OSL \
--num-prompt $PROMPTS \
--random-range-ratio 1.0 \
--max-concurrency $con
done