How to Run Qwen3-Coder-Next 80B on 8GB VRAM — Quantization Techniques Explained

Overview

What if you could run an 80B parameter coding-specialized model on a laptop GPU with just 8GB VRAM? A developer named nalexand from Reddit’s r/LocalLLaMA community released a project that makes this possible. They successfully ran Qwen3-Coder-Next 80B on an RTX 3070Ti (8GB VRAM) at 1.2 tokens/s.

This article analyzes the project’s core technologies — FP8 quantization, expert lazy loading, and cache optimization strategies — and examines the practical implications and limitations of running large LLMs on consumer GPUs.

The Core Challenge: Fitting 80B into 8GB

Why It Seems Impossible

Qwen3-Coder-Next is an 80B parameter model. Even with FP8 quantization, the model size reaches approximately 80GB. In an 8GB VRAM + 32GB RAM environment, loading the entire model into memory is simply impossible.

First Attempt: Disk Offloading

The developer first tried disk offloading using device="auto" with Hugging Face’s accelerate library. The results were dismal:

Speed: 1 token / 255 seconds
Practically unusable

This was a textbook case of disk I/O bottlenecks destroying inference performance.

The Solution: Expert Lazy Loading + Cache Optimization

Leveraging MoE Architecture Characteristics

The key insight came from analyzing the model structure. Most large tensors in the 80B model are concentrated in MLP experts, while the remaining components fit within approximately 4.6GB — well within VRAM capacity.

graph LR
    A[80B Model<br/>~80GB FP8] --> B[Non-Expert Layers<br/>~4.6GB]
    A --> C[MLP Experts<br/>~75.4GB]
    B --> D[Resident in VRAM]
    C --> E[Lazy Loaded from<br/>SSD/RAM]

Custom Lazy Loading System

The developer built a custom lazy loading system for MLP experts:

2-tier cache: VRAM cache + Pinned RAM cache
Cache hit rate: Up to 85%
Speed improvement: 255s/token → 1.2 tokens/s (~300x speedup)

Cache Parameter Tuning

# VRAM cache size (each 18 units ≈ ~3GB)
self.max_gpu_cache = 18

# RAM cache size (based on pinnable memory)
self.max_ram_cache = 100

GPU	Recommended max_gpu_cache	Expected Cache Hit Rate
RTX 3070Ti (8GB)	18	~85%
RTX 5090 (32GB)	120	>85%

Tech Stack and Installation

Requirements

Model: Qwen/Qwen3-Coder-Next-FP8 (download from Hugging Face)
GPU: 8GB+ VRAM
RAM: 32GB+ (pinnable memory is typically 1/2 of total RAM)
Storage: Fast NVMe SSD recommended (PCIe 5.0 RAID 0 up to 30GB/s)

Installation Steps

# 1. Download the model
hf-download Qwen/Qwen3-Coder-Next-FP8

# 2. Replace the modeling file in transformers library
# Replace transformers/models/qwen3_next/modeling_qwen3_next.py

# 3. Extract MLP experts
python extract_mlp.py

# 4. Run the chatbot
python coder_80b_next_chat.py

Real-World Performance Benchmarks

Here are the cache warmup test results shared by the developer:

Prompt	Tokens	Time	Speed
First “hi”	11	21.25s	0.52 t/s
Second “hi”	26	25.36s	1.03 t/s
”all good”	50	41.70s	1.20 t/s
Long response (807 tokens)	807	668.81s	1.21 t/s

After cache warmup, the system consistently maintains ~1.2 t/s. The first request is slower due to cache misses, but subsequent requests benefit from higher cache hit rates.

Practicality and Limitations

Advantages

Cost: Run an 80B coding model locally without cloud API costs
Privacy: Code never leaves your machine
Offline: Works without internet connection

Limitations

Speed: 1.2 t/s is insufficient for real-time coding assistance (Claude and GPT APIs deliver 30-80 t/s)
Initial latency: Cache warmup takes time
Installation complexity: Requires manual modification of transformers library files
Memory requirements: 32GB RAM is still needed

Future Outlook

GPU	VRAM	Expected Speed
RTX 3070Ti	8GB	~1.2 t/s (confirmed)
RTX 4090	24GB	5-10 t/s (estimated)
RTX 5090	32GB	20+ t/s (developer’s estimate)

With the RTX 5090’s 32GB VRAM and high memory bandwidth, setting max_gpu_cache=120 could potentially achieve 20+ t/s.

The Frontier of Local LLM Coding

This project embodies the local LLM community’s spirit of making the impossible possible. Developer nalexand has previously optimized various large models for low-spec GPUs, including LTX-2, Wan2.2, HeartMula, and ACE-Step 1.5.

Key takeaways:

Model structure analysis is the starting point for optimization: Understanding expert distribution in MoE models enables selective loading
Multi-tier caching is essential: A VRAM → Pinned RAM → SSD caching strategy achieved a 300x speedup
Hardware evolution narrows the gap: Next-generation GPUs may reach practical speeds

Conclusion

Running Qwen3-Coder-Next 80B on 8GB VRAM is a technically impressive achievement. While the current 1.2 t/s speed is insufficient for real-time coding assistance, advances in next-generation GPUs and optimization techniques are bringing large coding model execution on consumer hardware increasingly closer to reality.

Developers interested in local LLMs should check out nalexand’s GitHub repository and experiment with their own hardware.

Reading Complete!

How to Run Qwen3-Coder-Next 80B on 8GB VRAM — Quantization Techniques Explained

Overview

The Core Challenge: Fitting 80B into 8GB

Why It Seems Impossible

First Attempt: Disk Offloading

The Solution: Expert Lazy Loading + Cache Optimization

Leveraging MoE Architecture Characteristics

Custom Lazy Loading System

Cache Parameter Tuning

Tech Stack and Installation

Requirements

Installation Steps

Real-World Performance Benchmarks

Practicality and Limitations

Advantages

Limitations

Future Outlook

The Frontier of Local LLM Coding

Conclusion

References

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Reading Complete!

Overview

The Core Challenge: Fitting 80B into 8GB

Why It Seems Impossible

First Attempt: Disk Offloading

The Solution: Expert Lazy Loading + Cache Optimization

Leveraging MoE Architecture Characteristics

Custom Lazy Loading System

Cache Parameter Tuning

Tech Stack and Installation

Requirements

Installation Steps

Real-World Performance Benchmarks

Practicality and Limitations

Advantages

Limitations

Future Outlook

The Frontier of Local LLM Coding

Conclusion

References

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Related Articles

GPT-4o Retirement and Model Dependency Risk: Claude Overtakes in Enterprise Market

MiniMax M2.5: The Performance Gap Between Open-Weight and Proprietary Models Hits an All-Time Low

NVIDIA's NVFP4 Cuts LLM Inference Costs by 8x — While Maintaining Accuracy