Qwen3 Coder Next llama.cpp Graph Optimization — Up to 38% Inference Speedup

Qwen3 Coder Next llama.cpp Graph Optimization — Up to 38% Inference Speedup

ggerganov restructures the llama.cpp compute graph to achieve up to 38% inference speedup for the Qwen3 Coder Next 80B model. Detailed benchmark analysis and technical breakdown.

Overview

llama.cpp core developer ggerganov has published PR #19375, optimizing the compute graph for the Qwen3 Coder Next model. By eliminating unnecessary tensor copy operations and restructuring inference paths at the graph level, this optimization achieves up to 38% speedup on M2 Ultra and up to 38% on DGX Spark. With over 177 points on Reddit r/LocalLLaMA, let’s examine the core of this optimization.

Core Idea: Graph-Level Optimization

The core idea is straightforward: remove unnecessary tensor copy operations from the ggml compute graph.

Qwen3 Coder Next uses a MoE (Mixture of Experts) architecture where routers select which experts to activate and combine their outputs. The original implementation inserted excessive intermediate tensor copies for safety. ggerganov kept only the truly necessary copies and removed the rest.

graph LR
    A[Input Tensor] --> B[Router]
    B --> C[Expert 1]
    B --> D[Expert 2]
    B --> E[Expert 3]
    C --> F[Output Merge]
    D --> F
    E --> F
    F --> G[Next Layer]

    style A fill:#00B4D8,color:#fff
    style F fill:#FFB703,color:#000
    style G fill:#00B4D8,color:#fff

Benchmark Results

M2 Ultra Performance

Benchmarks for the Qwen3 Coder Next 80B.A3B model across different quantization levels.

Q4_0 Quantization

TestBaseline (t/s)Optimized (t/s)Speedup
pp1 (single token)37.9251.991.37x
pp8 (8-token batch)137.75176.361.28x
pp512 (prompt)930.701125.731.21x
pp2048 (long prompt)1049.911352.311.29x
tg32 (generation)38.0250.391.33x

Q4_K_M Quantization

TestBaseline (t/s)Optimized (t/s)Speedup
pp134.0046.471.37x
pp2048977.301232.471.26x
tg3234.6346.431.34x

Q8_0 Quantization

TestBaseline (t/s)Optimized (t/s)Speedup
pp134.3843.981.28x
pp20481047.391338.821.28x
tg3233.7543.781.30x

DGX Spark Performance

Significant improvements are also observed on the NVIDIA DGX Spark.

QuantTestBaseline (t/s)Optimized (t/s)Speedup
Q4_0pp5121055.581161.671.10x
Q4_0pp20481059.001324.661.25x
Q4_0tg3243.1159.581.38x
Q8_0pp20481009.431246.611.23x
Q8_0tg3231.1339.681.27x

Notably, the DGX Spark achieves a 38% speedup in tg32 (token generation) with Q4_0 quantization.

This PR doesn’t exist in isolation. For graph optimization to be effective, each backend (Metal, CUDA, Vulkan) needs to handle non-contiguous tensors directly.

Metal (Apple Silicon)

  • Adaptive CPU/GPU interleave (#19369): Dynamic workload distribution based on node count
  • Binary kernel consolidation (#19390): Duplicate kernel code removal
  • Unary ops consolidation (#19490): Improved unary operation handling
  • Non-contiguous L2 norm support (#19502)
  • Concurrency improvements (#19555)

CUDA (NVIDIA GPU)

  • Non-contiguous tensor PAD extension (#19429)
  • CUDA graphs enabled (#19521): For Qwen3 Next-style architectures
  • Prevent cgraph mutation for fused ADDs (#19566)

Vulkan

  • L2_NORM contiguous row support (#19604)
  • GGML_OP_SET support (#19584)

Caveat: BF16 Tensor Issue

Some GGUF files may incorrectly contain 1D BF16 tensors. These can hurt performance on backends like Metal. #19606 fixes this by storing ffn_gate_inp_shexp tensors as F32.

What’s Next

ggerganov has outlined further optimizations:

  1. Qwen3 family code deduplication (#19597): Sharing delta-net graphs
  2. ggml_build_forward_select() utilization: Making the graph constant for further optimization
  3. Dedicated delta net ggml op (#19504): More efficient kernel execution

Impact for Local LLM Users

Here’s what this optimization means in practice:

  • Apple Silicon users: Run the 80B MoE model at ~50 t/s on M2 Ultra for tg32. That’s more than enough for real-time conversation.
  • NVIDIA GPU users: 20–38% speedup on DGX Spark. CUDA graph support promises further improvements.
  • Quantization choice: Q4_0 shows the largest gains, but Q4_K_M and Q8_0 also deliver consistent 20–37% improvements.
  • No code changes needed: Simply update llama.cpp to the latest version.

Conclusion

ggerganov’s graph-level optimization significantly improves MoE model inference performance in llama.cpp. Rather than optimizing individual kernels, the approach of restructuring the compute graph itself is impressive. Combined with parallel efforts to expand non-contiguous tensor support across multiple backends (Metal, CUDA, Vulkan), this pushes the boundaries of local LLM inference performance.

If you’re running MoE models like Qwen3 Coder Next locally, update llama.cpp to the latest version and experience the speedup immediately.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.