MI355X · ROCm 7.2 · SGLang v0.5.9

SGLang AMD
Cookbook

Copy-paste ready commands for deploying and benchmarking Qwen3.5, Kimi-K2.5, and GLM-5 on AMD Instinct MI355X GPUs.

Hardware: 8× MI355X (288GB HBM3E each)
Memory BW: 8 TB/s per GPU
Updated: March 2026

Prerequisites

System Preparation

Disable NUMA balancing for optimal GPU performance.

bash
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
# Verify
cat /proc/sys/kernel/numa_balancing  # Should output 0

Model Weight Paths

Host path /mnt/dcgpuval/huggingface → bind-mounted to /sgl-workspace/models inside containers.

Qwen3.5: models--Qwen--Qwen3.5-397B-A17B/ (752 GB)
Kimi-K2.5: models--moonshotai--Kimi-K2.5/ (555 GB)
GLM-5: models--zai-org--GLM-5-FP8/ (705 GB)

Docker Images

Two base images are available. The ROCm 7.2 image (20260317) is recommended for Kimi-K2.5 with optimized branches. For Qwen3.5, use the ROCm 7.0 image (20260310) with the fixed Dockerfile.

bash — pull base images
# ROCm 7.2 image (recommended for Kimi-K2.5 optimized config)
docker pull rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317

# ROCm 7.0 image (for Qwen3.5 / GLM-5)
docker pull rocm/sgl-dev:v0.5.9-rocm700-mi35x-20260310
bash — build fixed image with aiter stub
# Clone the cookbook repo (contains Dockerfile + helper scripts)
git clone https://github.com/jhinpan/sglang-cookbook.git && cd sglang-cookbook

# Build fixed image: disables broken aiter, patches quark imports
docker build -t sglang-test:v0.5.9-rocm700-mi35x-20260310 \
    --build-arg BASE=rocm/sgl-dev:v0.5.9-rocm700-mi35x-20260310 \
    -f Dockerfile.bisect .
Required: Override SGLANG_ROCM_FUSED_DECODE_MLA

The base image sets SGLANG_ROCM_FUSED_DECODE_MLA=1, which crashes the triton attention backend with a ForwardMetadata unpacking error. Override to 0 via -e SGLANG_ROCM_FUSED_DECODE_MLA=0 at docker run time (shown in model configs below).

Model Deployment

Select a model below to see its launch configuration for both TP=4 and TP=8.

Qwen3.5-397B-A17B

Mixture-of-Experts — 397B total params, 17B active per token

MoE BF16 752 GB
HF Path
Qwen/Qwen3.5-397B-A17B
Decode Throughput
53–59 tok/s (TP8, BS1)
Attention Backend
triton
Status
Verified ✓

TP=8 Launch Server (Verified)

Tested end-to-end on 8× MI355X. Hybrid DeltaNet architecture: 45 recurrent + 15 GQA layers. --max-mamba-cache-size 128 is critical (+45% perf).

bash — docker run
docker run -d \
    --name qwen35-serve \
    --ipc=host --network=host --privileged \
    --shm-size 32G \
    --ulimit core=0:0 \
    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    -e HF_HUB_OFFLINE=1 \
    -e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
    -v /mnt/dcgpuval/huggingface:/sgl-workspace/models \
    sglang-test:v0.5.9-rocm700-mi35x-20260310 \
    python3 -m sglang.launch_server \
        --model-path /sgl-workspace/models/hub/models--Qwen--Qwen3.5-397B-A17B/snapshots/98d1a504ba52e88924b3a3a008447cf2fdbd518c \
        --served-model-name Qwen3.5-397B-A17B \
        --tp 8 \
        --trust-remote-code \
        --attention-backend triton \
        --mem-fraction-static 0.80 \
        --max-mamba-cache-size 128 \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder \
        --watchdog-timeout 1200 \
        --host 0.0.0.0 \
        --port 30000

Key flags: --max-mamba-cache-size 128 critical for DeltaNet’s 45 recurrent layers (+45% perf vs default 64); --tool-call-parser qwen3_coder for robust tool calling (10/10 multi-turn); --reasoning-parser qwen3 separates <think> blocks; CUDA graph enabled (default); SGLANG_ROCM_FUSED_DECODE_MLA=0 required for triton backend; --ulimit core=0:0 prevents GPU core dumps from filling disk on crash (~200 GB each).

BENCH bench_one_batch_server Results (TP=8, batch=1)

Input Output TTFT Decode Total Latency
1,0245120.11 s58.90 tok/s8.69 s
1,0241,0240.11 s59.17 tok/s17.31 s
8,1925120.31 s55.71 tok/s9.19 s
8,1921,0240.25 s56.68 tok/s18.07 s
16,3845120.51 s52.32 tok/s9.79 s
16,3841,0240.45 s53.79 tok/s19.04 s

Image: sglang-test:v0.5.9-rocm700-mi35x-20260310 • 8× MI355X • triton backend • CUDA graph on • mamba-cache=128 • March 2026

Known Issue: DeltaNet Mamba Memory Leak

Qwen3.5’s 45 DeltaNet recurrent layers can leak memory over extended runtime, causing NCCL process group failures after ~10–12 hours. See Issue #20010 / PR #20182. Workaround: periodic container restart (docker restart qwen35-serve).

Verification

bash
# Wait for "The server is fired up and ready to roll!"
docker logs -f qwen35-serve

# Health check
curl http://localhost:30000/health

# Inference test (use max_tokens >= 512 for reasoning models)
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3.5-397B-A17B",
      "messages": [{"role": "user", "content": "What is 2+2?"}],
      "max_tokens": 512,
      "temperature": 0.0
    }' | python3 -m json.tool

# Tool calling test
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3.5-397B-A17B",
      "messages": [{"role": "user", "content": "Read /etc/hostname"}],
      "tools": [{"type":"function","function":{"name":"ReadFile","description":"Read a file.","parameters":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}}}],
      "tool_choice": "auto",
      "max_tokens": 1024
    }' | python3 -m json.tool

LoRA SFT Serving the Fine-tuned Model (v4)

The LoRA SFT v4 adapter fine-tunes Qwen3.5 on AMD GPU kernel engineering trajectories (270 examples, rank 32, 13 target module types). SGLang’s runtime LoRA does not support this adapter’s DeltaNet recurrent modules (in_proj_a/b/z/qkv, out_proj) or MoE gate (shared_expert_gate). The workaround is to merge the adapter offline and serve the merged model as a standalone checkpoint.

Why Runtime LoRA Won’t Work

SGLang LoRA supports: q/k/v/o_proj, gate/up/down_proj. This adapter also targets 6 unsupported modules across Qwen3.5’s 45 DeltaNet layers and MoE gates. Attempting --lora-paths will fail at init_lora_shapes(). See sglang#9897.

Step 1: Merge LoRA into Base Model (offline, ~25 min)

Use LLaMA-Factory to merge. Requires ~800 GB RAM. Output: 743 GB (122 × 5 GB shards).

yaml — merge config (examples/merge_lora/qwen35_397b_lora_sft_amdpilot.yaml)
### model
model_name_or_path: /sgl-workspace/models/hub/models--Qwen--Qwen3.5-397B-A17B/snapshots/7cad2bae11cb49ca79f7d6a0954de2e2756f4e27
adapter_name_or_path: JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4
template: qwen3_5_nothink
trust_remote_code: true

### export
export_dir: /sgl-workspace/models/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged
export_size: 5
export_device: cpu
export_legacy_format: false
bash — run merge
llamafactory-cli export examples/merge_lora/qwen35_397b_lora_sft_amdpilot.yaml
Step 2: Serve Merged Model with SGLang

Identical to the base Qwen3.5 launch command, just swap --model-path and --served-model-name. All flags (--max-mamba-cache-size, --reasoning-parser, etc.) carry over.

bash — docker run
docker run -d \
    --name qwen35-sft-serve \
    --ipc=host --network=host --privileged \
    --shm-size 32G \
    --ulimit core=0:0 \
    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    -e HF_HUB_OFFLINE=1 \
    -e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
    -v /mnt/dcgpuval/huggingface:/sgl-workspace/models \
    sglang-test:v0.5.9-rocm700-mi35x-20260310 \
    python3 -m sglang.launch_server \
        --model-path /sgl-workspace/models/hub/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged \
        --served-model-name Qwen3.5-397B-A17B-SFT-v4 \
        --tp 8 \
        --trust-remote-code \
        --attention-backend triton \
        --mem-fraction-static 0.80 \
        --max-mamba-cache-size 128 \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder \
        --watchdog-timeout 1200 \
        --host 0.0.0.0 \
        --port 30000
Step 3: Verify
bash
# Wait for server ready
docker logs -f qwen35-sft-serve

# Health check
curl http://localhost:30000/health

# Test inference with the fine-tuned model
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3.5-397B-A17B-SFT-v4",
      "messages": [{"role": "user", "content": "What is 2+2?"}],
      "max_tokens": 512,
      "temperature": 0.0
    }' | python3 -m json.tool
Adapter Details
HF Adapter
JinnP/Qwen3.5-397B-A17B-LoRA-SFT-v4
Rank / Alpha
32 / 64
Trainable Params
128.5M (0.032%)
Merged Size
743 GB (122 shards)
Dataset
270 examples (3-view)
Best Eval Loss
0.0547 (epoch 8)
Target Modules
13 types (all)
NFS Path
hub/Qwen3.5-397B-A17B-LoRA-SFT-v4-merged/

Kimi-K2.5

Mixture-of-Experts — 1T total params, 32B active per token, W4A16 quantized

MoE W4A16 555 GB
HF Path
moonshotai/Kimi-K2.5
Decode Throughput
42.6 tok/s optimized (TP8, BS1)
Attention Backend
triton decode + aiter prefill
Status
Verified ✓

BEST TP=8 Optimized Serving (Verified, 38.6% faster)

Achieves 23.5ms decode median (42.6 tok/s) vs 38.3ms baseline — a 38.6% improvement. Uses the stock ROCm 7.2 image with optimized sglang and aiter branches checked out inside the container. Three key optimizations: GEMM A16W16 small-M configs (BLOCK_SIZE_M=16–64 for M=1 decode), MoE Triton kernel configs (E=384, N=128 for MI355X), and hybrid attention (triton decode + aiter prefill).

Step 0: Pull Base Image
bash
docker pull rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317
Step 1: Start Container
bash — docker run
docker run -d --name kimi-k25-server \
    --device /dev/kfd --device /dev/dri \
    --shm-size 64g --network host \
    --cap-add SYS_PTRACE --group-add video \
    -e GPU_COREDUMP_ENABLE=0 \
    -e SGLANG_ROCM_FUSED_DECODE_MLA=0 \
    -e SGLANG_USE_AITER=1 \
    -v /path/to/huggingface:/root/.cache/huggingface \
    rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317 \
    sleep infinity

Replace /path/to/huggingface with the parent directory containing the hub/ folder with models--moonshotai--Kimi-K2.5.

Step 2a: Apply Optimized sglang Branch (MoE Triton configs)
bash
docker exec kimi-k25-server bash -c '
cd /sgl-workspace/sglang
git stash && git clean -fd
git remote add fork https://github.com/Arist12/sglang.git
git fetch fork kimi-k25-optimize-v2
git checkout fork/kimi-k25-optimize-v2
cd python && pip install -e .
'
Step 2b: Apply Optimized aiter Branch (GEMM A16W16 small-M configs)
bash
docker exec kimi-k25-server bash -c '
cd /sgl-workspace/aiter
git stash && git clean -fd
git remote add fork https://github.com/Arist12/aiter.git
git fetch fork kimi-k25-optimize-v2
git checkout fork/kimi-k25-optimize-v2
pip install .
'

Must use pip install . (non-editable) for aiter. Using pip install -e . creates a broken namespace package that fails to resolve compiled C extensions. Verify with: python3 -c "from aiter import dynamic_per_tensor_quant; print('OK')"

Step 3: Launch Server
bash
docker exec -d kimi-k25-server bash -c '
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export SGLANG_USE_AITER=1
/opt/venv/bin/python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2.5 \
    --tp 8 \
    --trust-remote-code \
    --decode-attention-backend triton \
    --prefill-attention-backend aiter \
    --mem-fraction-static 0.85 \
    --reasoning-parser kimi_k2 \
    --tool-call-parser kimi_k2 \
    --host 0.0.0.0 --port 30000 \
    > /tmp/server.log 2>&1
'

Key flags: --decode-attention-backend triton + --prefill-attention-backend aiter hybrid attention (triton for decode, aiter ASM kernels for prefill); SGLANG_ROCM_FUSED_DECODE_MLA=0 required (image default is 1); SGLANG_USE_AITER=1 enables aiter prefill path; --mem-fraction-static 0.85 for max KV cache; --reasoning-parser kimi_k2 separates reasoning from content; --tool-call-parser kimi_k2 enables structured tool calling; CUDA graph enabled (default).

What the optimized branches change: sglang (Arist12/sglang:kimi-k25-optimize-v2): MoE Triton kernel configs for E=384,N=128 on MI355X with BLOCK_SIZE_M=16 for batch=1 decode. aiter (Arist12/aiter:kimi-k25-optimize-v2): GEMM A16W16 small-M configs (M_LEQ_4/8/16/32/64 with BLOCK_SIZE_M=16–64, default was 256).

BENCH Optimized Benchmark Results (TP=8, batch=1)

bench_one_batch (raw decode latency, input=8192, output=2048)

Configuration Decode Median Prefill Improvement
Baseline (triton attn, default configs)38.3 ms
+ AITER prefill attention34.4 ms10.2%
+ GEMM A16W16 small-M tuning24.3 ms36.6%
+ MoE Triton config tuning (final)23.5 ms (42.6 tok/s)637 ms (12,847 tok/s)38.6%

bench_one_batch_server (server throughput, output=2048)

Input Output TTFT Decode Total Latency
1,0242,0480.27 s45.19 tok/s45.32 s
2,0482,0480.33 s44.67 tok/s45.84 s

Image: rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317 + optimized branches • 8× MI355X • triton decode + aiter prefill • CUDA graph on • March 2026

Previous Results: Grid Search (stock images, no optimized branches)

18 configs tested across 2 ROCm images with stock sglang/aiter. These results are superseded by the optimized config above.

rocm700 (v0.5.9-rocm700-mi35x-20260319)

Input Output TTFT Decode Total Latency
1,0245123.66 s33.62 tok/s18.86 s
1,0241,0241.54 s35.54 tok/s16.57 s
8,1925125.58 s23.36 tok/s27.41 s
8,1921,0245.57 s23.31 tok/s43.93 s

rocm720 (v0.5.9-rocm720-mi35x-20260320)

Input Output TTFT Decode Total Latency
1,0245121.40 s26.47 tok/s20.70 s
1,0241,0240.84 s26.34 tok/s39.67 s
8,1925124.17 s19.48 tok/s30.40 s
8,1921,0244.20 s19.37 tok/s21.95 s

ADDON Eagle3 Speculative Decoding

1.8× decode speedup with lightseekorg/kimi-k2.5-eagle3 (3B, 1 layer, BF16). Supports greedy and non-greedy (temp>0) on ROCm.

Eagle3 1.8×
bash — Eagle3 launch (add --speculative flags to the optimized server command)
# After applying the optimized branches (Steps 2a/2b above), launch with Eagle3:
docker exec -d kimi-k25-server bash -c '
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export SGLANG_USE_AITER=1
/opt/venv/bin/python3 -m sglang.launch_server \
    --model-path moonshotai/Kimi-K2.5 \
    --tp 8 --trust-remote-code \
    --decode-attention-backend triton \
    --prefill-attention-backend aiter \
    --mem-fraction-static 0.75 \
    --reasoning-parser kimi_k2 \
    --tool-call-parser kimi_k2 \
    --host 0.0.0.0 --port 30000 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
    > /tmp/server.log 2>&1
'

Must use --speculative-algorithm EAGLE3 (not EAGLE). Using EAGLE silently degrades accept_length to 1.0 because it skips the 3-layer aux hidden state capture that Eagle3 requires. See PR #21272 for auto-detection fix. Note --mem-fraction-static 0.75 (reduced from 0.85) to leave VRAM for the draft model.

Eagle3 Decode Throughput (TP=8, BS=1, math/coding tasks)

Task Greedy (temp=0) Non-greedy (temp=0.6) vs Baseline
math_algebra69.8 tok/s66.2 tok/s1.9×
math_series69.3 tok/s71.7 tok/s2.0×
coding_fibonacci66.6 tok/s57.7 tok/s1.6×
coding_sort66.3 tok/s56.0 tok/s1.6×
coding_json58.6 tok/s55.8 tok/s1.6×
AVERAGE66.1 tok/s61.5 tok/s1.8× / 1.7×

Accept length: greedy 2.3–3.6, non-greedy 3.0–3.4 • Baseline: 35.8 tok/s (rocm700 stock) • --mem-fraction-static 0.75 required for Eagle3

Non-greedy on ROCm: Requires PyTorch fallback kernels for 3 missing sgl_kernel C++ ops. Apply via patch_eagle_rocm.py from this repo, or use PR #21275 once merged. Without the patch, temp>0 falls back to greedy silently.

Known Limitation: Long Context

Eagle3 accept_length degrades at 8K+ input tokens (falls to 1.6–2.0), making it slower than baseline. This is a draft model quality issue, not a serving bug. See #6783. Use Eagle3 for short-context, high-throughput workloads (coding, math, chat).

Verification

bash
# Monitor startup logs (wait for "The server is fired up and ready to roll!")
docker exec kimi-k25-server tail -f /tmp/server.log

# Health check
curl http://localhost:30000/health

# Chat completion (note reasoning_content field from --reasoning-parser kimi_k2)
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "moonshotai/Kimi-K2.5",
      "messages": [{"role": "user", "content": "Hello!"}],
      "max_tokens": 100
    }' | python3 -m json.tool

# Tool calling test (note tool_calls array from --tool-call-parser kimi_k2)
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "moonshotai/Kimi-K2.5",
      "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
      "tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather for a location","parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
      "max_tokens": 200
    }' | python3 -m json.tool

GLM-5-FP8

MoE + Native Sparse Attention (NSA) — 744B total params, 40B active, DeepSeek-V2 architecture

MoE NSA FP8 705 GB
HF Path
zai-org/GLM-5-FP8
Model Type
glm_moe_dsa
NSA Backend
tilelang
Day-0 PR
Required: glm_moe_dsa Config Registration Patch

GLM-5 uses a glm_moe_dsa model type that HuggingFace Transformers doesn’t recognize natively. This is registered in SGLang’s config loader (included in recent SGLang builds). Ensure your SGLang version includes the fix from PR #18911.

# Ensure latest transformers (for GLM-5 tokenizer support)
pip install --upgrade transformers

TP=8 Launch Server (Recommended)

bash
python3 -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --served-model-name glm-5-fp8 \
    --tp 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --mem-fraction-static 0.80 \
    --nsa-prefill-backend tilelang \
    --nsa-decode-backend tilelang \
    --chunked-prefill-size 131072 \
    --watchdog-timeout 1200 \
    --port 30000

TP=4 Launch Server (4 GPUs)

Tight fit: 705 GB model in 1,152 GB. Reduce --mem-fraction-static to leave room for KV cache. TP=8 strongly recommended for GLM-5.

bash
HIP_VISIBLE_DEVICES=0,1,2,3 python3 -m sglang.launch_server \
    --model-path zai-org/GLM-5-FP8 \
    --served-model-name glm-5-fp8 \
    --tp 4 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --mem-fraction-static 0.60 \
    --nsa-prefill-backend tilelang \
    --nsa-decode-backend tilelang \
    --chunked-prefill-size 131072 \
    --disable-cuda-graph \
    --watchdog-timeout 1200 \
    --port 30000

Verification

bash
# Health check
curl http://localhost:30000/health

# List models
curl http://localhost:30000/v1/models | python3 -m json.tool

# Quick inference test
curl -s http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "glm-5-fp8",
      "messages": [{"role": "user", "content": "Hello, who are you?"}],
      "max_tokens": 64
    }' | python3 -m json.tool

Benchmarking

Use bench_one_batch_server to measure single-batch latency including HTTP and scheduler overhead. Connect to a running server.

Test Matrix

Input Tokens Output Tokens Batch Size Measures
1,0245121Prefill + decode latency
1,0241,0241Short-context long-generation
8,1925121Medium-context prefill stress
8,1921,0241Medium-context full pipeline
16,3845121Long-context prefill stress
16,3841,0241Long-context full pipeline

Run Full Benchmark Suite

Iterates through all input/output length combinations. Assumes server is running on port 30000.

bash — full benchmark sweep
for INPUT_LEN in 1024 8192 16384; do
  for OUTPUT_LEN in 512 1024; do
    echo "====== Input: ${INPUT_LEN}, Output: ${OUTPUT_LEN} ======"
    python3 -m sglang.bench_one_batch_server \
        --model None \
        --base-url http://localhost:30000 \
        --batch-size 1 \
        --input-len $INPUT_LEN \
        --output-len $OUTPUT_LEN
    echo ""
  done
done

Individual Commands

Pick and run a specific combination.

1K in → 512 out
python3 -m sglang.bench_one_batch_server \
    --model None \
    --base-url http://localhost:30000 \
    --batch-size 1 \
    --input-len 1024 \
    --output-len 512
1K in → 1K out
python3 -m sglang.bench_one_batch_server \
    --model None \
    --base-url http://localhost:30000 \
    --batch-size 1 \
    --input-len 1024 \
    --output-len 1024
8K in → 512 out
python3 -m sglang.bench_one_batch_server \
    --model None \
    --base-url http://localhost:30000 \
    --batch-size 1 \
    --input-len 8192 \
    --output-len 512
8K in → 1K out
python3 -m sglang.bench_one_batch_server \
    --model None \
    --base-url http://localhost:30000 \
    --batch-size 1 \
    --input-len 8192 \
    --output-len 1024
16K in → 512 out
python3 -m sglang.bench_one_batch_server \
    --model None \
    --base-url http://localhost:30000 \
    --batch-size 1 \
    --input-len 16384 \
    --output-len 512
16K in → 1K out
python3 -m sglang.bench_one_batch_server \
    --model None \
    --base-url http://localhost:30000 \
    --batch-size 1 \
    --input-len 16384 \
    --output-len 1024

Online Serving Benchmark (Optional)

For more realistic multi-client throughput testing, use bench_serving instead.

bash
CON="16 32 64 128"
ISL=3200
OSL=800
for con in $CON; do
    PROMPTS=$(($con * 5))
    python3 -m sglang.bench_serving \
        --dataset-name random \
        --random-input-len $ISL \
        --random-output-len $OSL \
        --num-prompt $PROMPTS \
        --random-range-ratio 1.0 \
        --max-concurrency $con
done

References