All numbers in this report come from a single validation session on hjbog-srdc-2.amd.com, docker image clementlincf/amdafde:v0.5.10rc0-rocm720-mi30x-kimi-k2.5-opt-20260420. Logs are archived under progress/kimi-k2.5/container-validation-0420/.
kimi-K2.5-W8A8-dev-rebasedkimi-K2.5-W4A8-rebased HEAD b5757d6feature/w4a8-moe-port-rebased HEAD a1d8312Kimi-K2.5-W4A8 (497 GB)kimi-k2.5-eagle3 (6.0 GB)launch_eagle3.sh — EAGLE3 only, no env-var knobs enabledlaunch_eagle3_opt.sh — identical args + two env vars activating the patched code pathsaiter/FlyDSL and (b) two environment variables that activate code paths those commits added. No tokenizer, schedule, or sampling parameters changed.
Five logically independent changes across three commits in two repositories. Each patch is described in its own section below with the measured kernel-level effect.
| # | Patch | Repo · Commit | Scope | Activation |
|---|---|---|---|---|
| 1 | MoE Stage1 scheduler-hints gate | FlyDSL · a1d8312 | kernels/moe_gemm_2stage.py | FLYDSL_MOE_STAGE1_SCHED=0 |
| 2 | Stage1 auto tile_k=256 for decode | AITER · 511df6a | aiter/fused_moe.py | Automatic when block_m≤16 |
| 3 | Stage2 tile_k=256 override | AITER · 511df6a | aiter/fused_moe.py | AITER_FLYDSL_STAGE2_TILE_K=256 |
| 4 | MI308X bf16 GEMM tune (+65 entries) | AITER · b5757d6 | kimik2_bf16_tuned_gemm.csv | Automatic via cu_num=80 lookup |
| 5 | FMHA v3 bf16 rounding rtna → rtz | AITER · b5757d6 | aiter/ops/mha.py | Default changed (no env var needed) |
511df6a commit also adds AITER_FLYDSL_MOE_GRID_TRIM (default on) and a JSONL shape logger. These are infrastructure knobs; the grid-trim has measurable effect only for M≤8 (outside this EAGLE3 verify shape) and is not claimed as a contributor to the 17.2% figure.
FlyDSL's MoE Stage1 kernel ships with manual sched_barrier / sched_mfma / sched_dsrd / sched_vmem / sched_dswr hints intended to steer LLVM's instruction scheduler. On the W4A8 decode path, disabling these hints and letting LLVM's default scheduler run produces a shorter kernel.
Added two environment gates (FLYDSL_MOE_STAGE1_SCHED and FLYDSL_MOE_STAGE2_SCHED, both default on). Setting Stage1 to 0 skips all hand-written sched_* hints during compile. Stage2's hints are kept on by default (they are tuned per tile_m and were approximately neutral in measurements).
| Metric | Sched ON (default) | Sched OFF | Δ |
|---|---|---|---|
| Total instructions | 4 833 | 4 590 | −5.0% |
| MFMA | 448 | 448 | 0 |
| buffer_load | 293 | 293 | 0 |
| VALU | 3 335 | 3 335 | 0 |
| s_waitcnt | 399 | 149 | −62.7% |
| s_barrier | 56 | 56 | 0 |
| ds_write | 56 | 56 | 0 |
The MFMA / memory / LDS footprint is unchanged. The reduction is concentrated in s_waitcnt (memory-fence instructions): 399 → 149, a drop of 250 waitcnts. The manual hints constrain LLVM's vmcnt tracking in a way that causes conservative extra fences; removing them allows a tighter schedule with fewer stalls.
| Configuration | Stage1 (μs, M=40) | Δ |
|---|---|---|
Baseline (sched ON, tile_k=128) | 242 | — |
Sched OFF, tile_k=128 | 206 | −14.9% |
tile_k=256 for Decode AITERStage1's K-loop count is K / tile_k. For Kimi's int4 W4A8 Stage1 (K=7168), raising tile_k from 128 to 256 halves the number of K-iterations. The patch auto-picks tile_k=256 when block_m≤16 (decode regime) and leaves prefill at 128 where 256 was measured to regress.
# auto-pick tile_k=256 for W4A8 decode (block_m ≤ 16), keep 128 for prefill if _is_w_int4 and block_m <= 16: stage1_tile_k = 256 else: stage1_tile_k = 128
Halving the K-loop count also halves the loop-local barrier and s_waitcnt count. The patch additionally exposes AITER_FLYDSL_STAGE1_TILE_K as a manual override for experimentation.
At decode shapes (small M), the kernel is memory-bound and VGPR pressure is low; the larger tile_k keeps occupancy stable and reduces per-iteration overhead. At prefill shapes (large M), the wider B-tile increases VGPR pressure enough to drop a wave per SIMD, and the per-iteration overhead is a smaller fraction of the compute cost. Measurements confirmed prefill regresses with tile_k=256.
| Configuration | Stage1 (μs) | Δ from baseline |
|---|---|---|
Baseline (sched ON, tile_k=128) | 242 | — |
Sched OFF, tile_k=128 | 206 | −14.9% |
Sched OFF + auto tile_k=256 | 193 | −20.2% |
tile_k=256 Override AITERStage2's K dimension equals the MoE intermediate dimension, which for Kimi-K2.5 is inter_dim=256. The default tile_k=64 therefore runs 4 K-iterations per tile. Setting tile_k=256 collapses the K-loop to a single MFMA step, eliminating three loop-closing barriers and their associated s_waitcnts per tile.
# in flydsl_moe_stage2 dispatch _over_tk = os.environ.get('AITER_FLYDSL_STAGE2_TILE_K', '') if _over_tk: tile_k = int(_over_tk)
| tile_m | tile_n | tile_k | Time (μs) | Δ | Correctness |
|---|---|---|---|---|---|
| default | 256 | 64 (default) | 833.7 | baseline | ✓ |
| 32 | 256 | 64 | 834.1 | 0% | ✓ |
| — | 128 | 64 | 838.5 | +0.6% | ✓ |
| — | 256 | 128 | 813.3 | −2.5% | ✓ |
| — | 256 | 256 | 790.2 | −5.2% | ✓ |
| 32 | 256 | 256 | 789.9 | −5.3% | ✓ |
| 64 | 256 | 256 | 860.5 | +3.2% | ✗ wrong output |
| M | default (μs) | tile_k=256 (μs) | Δ |
|---|---|---|---|
| 40 | 370 | 368 | noise |
| 80 | 477 | 465 | −2.4% |
| 128 | 527 | 515 | −2.3% |
| 160 | 835 | 792 | −5.2% |
| 192 | 844 | 800 | −5.2% |
| 256 | 877 | 830 | −5.3% |
concurrency × (num_draft_tokens + 1) tokens per forward pass. With conc=40 and num_draft_tokens=4, this lands around M=160–192. Those shapes sit at the tile_k boundary where Stage2's 4-iteration K-loop is a significant overhead; decode-only shapes (M=40) are small enough that the overhead is already negligible.
AITER's tuned GEMM configuration file kimik2_bf16_tuned_gemm.csv shipped with entries keyed to cu_num=256 (MI300X). On MI308X (cu_num=80), every lookup missed, so every MLA projection GEMM fell back to the generic torch.matmul path (rocBLAS Tensile). The patch adds 65 entries covering the verify-step shapes that AITER's hand-written ASM GEMM wins on.
M ∈ {72…256} · N ∈ {2112, 3072, 3584, 4608, 7168, 14336} · K ∈ {7168, 14336}. Winning kernels are bf16gemm_fp32bf16_tn_*x64_pf3_splitk and *_splitk_clean (AITER hand-written asm).
| N | K | rocBLAS (μs) | AITER ASM kernel | AITER (μs) | Speedup |
|---|---|---|---|---|---|
| 2112 | 7168 | 181.6 | 96x64_pf3_splitk sk=1 | 77.5 | 2.3× |
| 4608 | 7168 | 495.2 | 96x64_pf3_splitk sk=1 | 143.9 | 3.4× |
| 3072 | 7168 | 163.2 | 64x64_pf3_splitk sk=1 | 98.5 | 1.7× |
| 3072 | 14336 | 204.4 | 64x64_pf3_splitk sk=1 | 185.5 | 1.1× |
MLA layers in each decoder block issue four bf16 GEMMs (q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj / o_proj). With 60 layers × 4 GEMMs per forward pass and no cu_num=80 match, all 240 GEMMs per step went through rocBLAS. The added entries target the M range the EAGLE3 verify step actually produces (M=160–256), so the fast-path is hit at every verify invocation.
AITER's FMHA v3 (asm attention) exposes three bf16 rounding modes for the fp32→bf16 conversion in the attention epilogue: rtne (0), rtna (1, default), and rtz (2). On gfx942, rtz maps to the native single-instruction v_cvt_pk_rtz_bf16_f32; rtna is emulated with multiple instructions. The patch flips the default from 1 to 2 at nine API sites in aiter/ops/mha.py.
# aiter/ops/mha.py, 9 call sites - how_v3_bf16_cvt: Optional[int] = 1 # rtna + how_v3_bf16_cvt: Optional[int] = 2 # rtz
The value is threaded through Python → C++ struct mha_fwd_args.how_v3_bf16_cvt → CSV lookup in hsa/gfx942/fmha_v3_fwd/fmha_fwd.csv keyed on (dtype, hdim_q, hdim_v, mask, mode, bf16_cvt). The matching row names a .co (HSA code object) which AiterAsmKernel loads. So the change selects a different pre-compiled ASM binary at every call.
| Rounding mode | Code object | Size (bytes) |
|---|---|---|
| rtne (0) | fwd_hd128_bf16_rtne.co | 28 720 |
| rtna (1, old default) | fwd_hd128_bf16_rtna.co | 27 120 |
| rtz (2, new default) | fwd_hd128_bf16_rtz.co | 23 272 |
csrc/cpp_itfs/mha_bwd.cu contains a clamp:
// rtna & rtz are deprecated in gfx950 if (get_gfx() == "gfx950" && how_v3_bf16_cvt != 0) how_v3_bf16_cvt = 0; // force back to rtne
The default flip therefore only affects gfx942 targets; gfx950 behaviour is unchanged.
rtz (truncation) has a larger single-op expected error than rtna (round-to-nearest), but the difference is within the 7-bit bf16 mantissa noise floor. End-to-end GSM8K accuracy was checked after the change (see Accuracy) and stayed within the ±0.66% stderr band.
Three runs from the same image on the same host: baseline W4A8 (no EAGLE3, no opt), EAGLE3 only, EAGLE3 + opt patches. Workload: 160 random prompts, concurrency 40, input 10240, output 512.
| Metric | Baseline (W4A8) | EAGLE3 | EAGLE3 + opt | Δ (EAGLE3 → opt) |
|---|---|---|---|---|
| Duration (s) | 288.20 | 269.20 | 253.27 | −5.9% |
| Total throughput (tok/s) | 5 969 | 6 391 | 6 793 | +6.3% |
| Output throughput (tok/s) | 284.25 | 304.31 | 323.45 | +6.3% |
| TPOT median (ms) | 93.00 | 90.16 | 74.66 | −17.2% |
| TPOT mean (ms) | 93.29 | 92.38 | 75.72 | −18.0% |
| TPOT P99 (ms) | 135.76 | 187.98 | 128.15 | −31.8% |
| ITL median (ms) | 50.81 | 34.23 | 29.05 | −15.1% |
| E2E median (s) | 71.79 | 65.83 | 62.28 | −5.4% |
| TTFT median (s) | 24.18 | 18.65 | 23.96 | +28.5% |
| Accept length | — | 3.93 | 3.93 | 0 |
M) are outside the bf16 GEMM tune's coverage and tile_k=256 regresses there. Under constant concurrency, a faster decode accumulates more queued requests which raises prefill queue depth, inflating TTFT. The 160-prompt concurrency-40 workload is decode-bound, so TPOT and throughput improve despite TTFT moving in the opposite direction.
Per-patch deltas aggregated from the 0418 session (same workload, different host). Absolute TPOT differs slightly between host sessions (80.96 ms on that run vs 90.16 ms on this one) due to cluster variability; percentage contributions are consistent.
GSM8K (10-shot, concurrent 256) was run at each stage to verify the kernel changes do not degrade generation quality. All three configurations produce accuracy within each other's ±0.66–0.70% stderr band.
| Configuration | Strict-match | stderr | Flexible-extract | stderr |
|---|---|---|---|---|
| Baseline (W4A8, no EAGLE3) | 0.9310 | ±0.0070 | 0.9318 | ±0.0069 |
| EAGLE3 | 0.9393 | ±0.0066 | 0.9401 | ±0.0065 |
| EAGLE3 + opt | 0.9386 | ±0.0066 | 0.9378 | ±0.0067 |
All numbers in this report are reproducible from the published image. The container entrypoint scripts are archived under progress/kimi-k2.5/container-validation-0420/.
# 1. Pull and create container docker run -d --name mycontainer \ --network host --cap-add=CAP_IPC_LOCK --cap-add=CAP_SYS_NICE \ --security-opt seccomp=unconfined --shm-size=128g \ --device /dev/kfd --device /dev/dri \ -v /mnt/md0/models:/mnt/md0/models -w /sgl-workspace \ clementlincf/amdafde:v0.5.10rc0-rocm720-mi30x-kimi-k2.5-opt-20260420 \ sleep infinity # 2. Launch optimized server (env vars baked into script) docker exec -d mycontainer bash -c "cd /opt/scripts && ./launch_eagle3_opt.sh" # 3. Wait ~6 min for ready, then run serving bench docker exec mycontainer /opt/scripts/bench_client.sh /tmp/bench.log # 4. (Optional) GSM8K accuracy check docker exec mycontainer /opt/scripts/gsm8k_eval.sh /tmp/gsm8k.log
# Source-level patches baked into /opt/aiter and /opt/FlyDSL HEADs. # Runtime knobs below activate the code paths those patches added. export FLYDSL_MOE_STAGE1_SCHED=0 # Patch 1: disable stage1 hand-sched hints export AITER_FLYDSL_STAGE2_TILE_K=256 # Patch 3: stage2 single-step K-loop # Patches 2, 4, 5 auto-activate (block_m check, cu_num=80 CSV lookup, new default).
| Component | Path (in container) |
|---|---|
| Launch scripts | /opt/scripts/launch_{baseline,eagle3,eagle3_opt}.sh |
| Bench client | /opt/scripts/bench_client.sh |
| AITER MoE dispatch | /opt/aiter/aiter/fused_moe.py |
| AITER FMHA Python API | /opt/aiter/aiter/ops/mha.py |
| AITER bf16 GEMM CSV | /opt/aiter/aiter/configs/model_configs/kimik2_bf16_tuned_gemm.csv |
| FlyDSL MoE kernel | /opt/FlyDSL/kernels/moe_gemm_2stage.py |
| FMHA ASM code objects | /opt/aiter/hsa/gfx942/fmha_v3_{fwd,bwd}/MI308/ |