Qwen3 Coder Next llama.cpp Graph Optimization — Up to 38% Inference Speedup
ggerganov restructures the llama.cpp compute graph to achieve up to 38% inference speedup for the Qwen3 Coder Next 80B model. Detailed benchmark analysis and technical breakdown.
Overview
llama.cpp core developer ggerganov has published PR #19375, optimizing the compute graph for the Qwen3 Coder Next model. By eliminating unnecessary tensor copy operations and restructuring inference paths at the graph level, this optimization achieves up to 38% speedup on M2 Ultra and up to 38% on DGX Spark. With over 177 points on Reddit r/LocalLLaMA, let’s examine the core of this optimization.
Core Idea: Graph-Level Optimization
The core idea is straightforward: remove unnecessary tensor copy operations from the ggml compute graph.
Qwen3 Coder Next uses a MoE (Mixture of Experts) architecture where routers select which experts to activate and combine their outputs. The original implementation inserted excessive intermediate tensor copies for safety. ggerganov kept only the truly necessary copies and removed the rest.
graph LR
A[Input Tensor] --> B[Router]
B --> C[Expert 1]
B --> D[Expert 2]
B --> E[Expert 3]
C --> F[Output Merge]
D --> F
E --> F
F --> G[Next Layer]
style A fill:#00B4D8,color:#fff
style F fill:#FFB703,color:#000
style G fill:#00B4D8,color:#fff
Benchmark Results
M2 Ultra Performance
Benchmarks for the Qwen3 Coder Next 80B.A3B model across different quantization levels.
Q4_0 Quantization
| Test | Baseline (t/s) | Optimized (t/s) | Speedup |
|---|---|---|---|
| pp1 (single token) | 37.92 | 51.99 | 1.37x |
| pp8 (8-token batch) | 137.75 | 176.36 | 1.28x |
| pp512 (prompt) | 930.70 | 1125.73 | 1.21x |
| pp2048 (long prompt) | 1049.91 | 1352.31 | 1.29x |
| tg32 (generation) | 38.02 | 50.39 | 1.33x |
Q4_K_M Quantization
| Test | Baseline (t/s) | Optimized (t/s) | Speedup |
|---|---|---|---|
| pp1 | 34.00 | 46.47 | 1.37x |
| pp2048 | 977.30 | 1232.47 | 1.26x |
| tg32 | 34.63 | 46.43 | 1.34x |
Q8_0 Quantization
| Test | Baseline (t/s) | Optimized (t/s) | Speedup |
|---|---|---|---|
| pp1 | 34.38 | 43.98 | 1.28x |
| pp2048 | 1047.39 | 1338.82 | 1.28x |
| tg32 | 33.75 | 43.78 | 1.30x |
DGX Spark Performance
Significant improvements are also observed on the NVIDIA DGX Spark.
| Quant | Test | Baseline (t/s) | Optimized (t/s) | Speedup |
|---|---|---|---|---|
| Q4_0 | pp512 | 1055.58 | 1161.67 | 1.10x |
| Q4_0 | pp2048 | 1059.00 | 1324.66 | 1.25x |
| Q4_0 | tg32 | 43.11 | 59.58 | 1.38x |
| Q8_0 | pp2048 | 1009.43 | 1246.61 | 1.23x |
| Q8_0 | tg32 | 31.13 | 39.68 | 1.27x |
Notably, the DGX Spark achieves a 38% speedup in tg32 (token generation) with Q4_0 quantization.
Technical Background: Related Backend Optimizations
This PR doesn’t exist in isolation. For graph optimization to be effective, each backend (Metal, CUDA, Vulkan) needs to handle non-contiguous tensors directly.
Metal (Apple Silicon)
- Adaptive CPU/GPU interleave (#19369): Dynamic workload distribution based on node count
- Binary kernel consolidation (#19390): Duplicate kernel code removal
- Unary ops consolidation (#19490): Improved unary operation handling
- Non-contiguous L2 norm support (#19502)
- Concurrency improvements (#19555)
CUDA (NVIDIA GPU)
- Non-contiguous tensor PAD extension (#19429)
- CUDA graphs enabled (#19521): For Qwen3 Next-style architectures
- Prevent cgraph mutation for fused ADDs (#19566)
Vulkan
Caveat: BF16 Tensor Issue
Some GGUF files may incorrectly contain 1D BF16 tensors. These can hurt performance on backends like Metal. #19606 fixes this by storing ffn_gate_inp_shexp tensors as F32.
What’s Next
ggerganov has outlined further optimizations:
- Qwen3 family code deduplication (#19597): Sharing delta-net graphs
ggml_build_forward_select()utilization: Making the graph constant for further optimization- Dedicated delta net ggml op (#19504): More efficient kernel execution
Impact for Local LLM Users
Here’s what this optimization means in practice:
- Apple Silicon users: Run the 80B MoE model at ~50 t/s on M2 Ultra for tg32. That’s more than enough for real-time conversation.
- NVIDIA GPU users: 20–38% speedup on DGX Spark. CUDA graph support promises further improvements.
- Quantization choice: Q4_0 shows the largest gains, but Q4_K_M and Q8_0 also deliver consistent 20–37% improvements.
- No code changes needed: Simply update llama.cpp to the latest version.
Conclusion
ggerganov’s graph-level optimization significantly improves MoE model inference performance in llama.cpp. Rather than optimizing individual kernels, the approach of restructuring the compute graph itself is impressive. Combined with parallel efforts to expand non-contiguous tensor support across multiple backends (Metal, CUDA, Vulkan), this pushes the boundaries of local LLM inference performance.
If you’re running MoE models like Qwen3 Coder Next locally, update llama.cpp to the latest version and experience the speedup immediately.
References
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕