MiniMax M2.5: The Performance Gap Between Open-Weight and Proprietary Models Hits an All-Time Low
MiniMax M2.5 achieves 80.2% on SWE-Bench Verified, surpassing Claude Opus 4.6. We analyze how the performance gap between open-weight and proprietary models is rapidly closing, with comprehensive benchmark data.
The Open-Weight Counterattack Has Begun
In February 2026, a shockwave hit the AI industry. MiniMax M2.5, released by the Chinese AI startup MiniMax, scored higher than proprietary models across multiple benchmarks including coding, agentic tasks, and search.
The post gathered over 362 points on Reddit’s r/LocalLLaMA, sparking active discussion about “open-weight models finally catching up to closed models.” In this article, we analyze M2.5’s specific performance data and the shifting landscape of open vs. closed models.
MiniMax M2.5 Key Specifications
MiniMax M2.5 is a 229B parameter open-weight model freely available on HuggingFace.
- Parameters: 229B (MoE architecture)
- Training: Reinforcement learning across 200,000+ real-world environments
- Inference Speed: 100 tokens/second (Lightning version)
- Languages: Go, C, C++, TypeScript, Rust, Python, Java, and 10+ more
- Deployment: SGLang, vLLM, Transformers, KTransformers supported
Benchmark Comparison: The Gap With Closed Models Approaches Zero
SWE-Bench Verified (Coding)
SWE-Bench Verified measures the ability to resolve real GitHub issues.
| Model | Score | Type |
|---|---|---|
| MiniMax M2.5 | 80.2% | Open-weight |
| Claude Opus 4.6 | — | Proprietary |
| MiniMax M2.1 | — | Open-weight |
Results across different agent harnesses are particularly noteworthy:
- Droid harness: M2.5 (79.7%) > Opus 4.6 (78.9%)
- OpenCode harness: M2.5 (76.1%) > Opus 4.6 (75.9%)
In both environments, the open-weight model edged out the proprietary model — a historic result.
Multi-SWE-Bench (Multi-Repository)
M2.5 achieved 51.3% on tasks spanning multiple repositories, demonstrating strong performance in complex real-world scenarios.
BrowseComp (Search & Tool Use)
On BrowseComp, which measures web search and tool-calling abilities, M2.5 scored 76.3% (with context management), reaching industry-leading levels.
The Cost Revolution: Dominance in Price, Not Just Performance
The impact of M2.5 extends beyond performance. The cost-performance ratio is in a different league.
| Metric | M2.5 Lightning | M2.5 Standard |
|---|---|---|
| Input Price | $0.3/M tokens | $0.15/M tokens |
| Output Price | $2.4/M tokens | $1.2/M tokens |
| Inference Speed | 100 TPS | 50 TPS |
| 1-hour Continuous Cost | $1.0 | $0.3 |
Compared to Claude Opus, Gemini 3 Pro, and GPT-5, M2.5’s output token cost is 1/10th to 1/20th of the price.
Why M2.5 Evolved So Rapidly
Massive RL Scaling
MiniMax developed an in-house agent-native RL framework called Forge.
graph TD
A[Forge RL Framework] --> B[200K+ Real Environments]
A --> C[CISPO Algorithm]
A --> D[Process Reward Mechanism]
B --> E[Coding Envs]
B --> F[Search Envs]
B --> G[Office Work Envs]
C --> H[Stable MoE Training]
D --> I[Long-Context Quality Monitoring]
E & F & G --> J[M2.5]
H & I --> J
Key technical highlights:
- Async scheduling optimization: Balancing system throughput against sample off-policyness
- Tree-structured merge strategy: ~40x training speedup for sample combining
- CISPO algorithm: Ensuring MoE model stability during large-scale training
- Process rewards: Addressing credit assignment in long-context agent rollouts
Emergent Spec-Writing Ability
A remarkable aspect of M2.5 is that the ability to design and plan like an architect before writing code emerged naturally during training. The model actively decomposes and plans project features, structure, and UI design before coding.
The Shifting Open vs. Closed Landscape
A Historic Turning Point
Until now, the AI industry operated under an implicit assumption: “the best-performing models are always proprietary.” M2.5 is changing that.
graph LR
subgraph 2024
A[Closed<br/>Dominant] --> B[Open<br/>Far Behind]
end
subgraph Late 2025
C[Closed<br/>Slight Edge] --> D[Open<br/>Catching Up]
end
subgraph Early 2026
E[Closed<br/>On Par] --- F[Open<br/>Surpassing in Areas]
end
What This Means for Enterprises
- Avoiding Vendor Lock-in: If open-weight models deliver frontier performance, dependency on specific API vendors can be reduced
- Customization Freedom: Fine-tuning with proprietary data and domain specialization becomes possible
- Cost Optimization: Self-hosting for cost control; even M2.5’s API is 1/10th to 1/20th the cost
- Data Privacy: No need to send sensitive data to external providers
The Rapid Evolution of the M2 Series
In just 3.5 months (late October 2025 to February 2026), MiniMax released three generations: M2, M2.1, and M2.5.
| Version | Release | SWE-Bench Improvement | Notable |
|---|---|---|---|
| M2 | Late Oct 2025 | Baseline | 450K HuggingFace downloads |
| M2.1 | Dec 2025 | Major improvement | 86.7K downloads |
| M2.5 | Feb 2026 | 80.2% SOTA | 37% faster, 1/10 cost |
Internal Production Adoption
MiniMax actively uses M2.5 within their own organization:
- 30% of company-wide tasks autonomously completed by M2.5
- Spanning R&D, product, sales, HR, and finance
- 80% of newly committed code generated by M2.5
Conclusion: Three Key Takeaways
-
The Performance Gap Has Vanished: An open-weight model has surpassed closed models on SWE-Bench. This is not a fluke — it’s the beginning of a structural shift
-
Cost Revolution: M2.5 delivers equal or better performance at 1/10th to 1/20th the cost of Opus. The “frontier model you don’t have to worry about cost for” is now real
-
Expanding Choices: Enterprises no longer need to default to proprietary models. Self-hosting, customization, and cost optimization through open-weight models are practical options
For AI developers, 2026 may mark the dawn of a golden age for open-weight models.
References
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕