MiniMax M2.5: The Performance Gap Between Open-Weight and Proprietary Models Hits an All-Time Low

MiniMax M2.5: The Performance Gap Between Open-Weight and Proprietary Models Hits an All-Time Low

MiniMax M2.5 achieves 80.2% on SWE-Bench Verified, surpassing Claude Opus 4.6. We analyze how the performance gap between open-weight and proprietary models is rapidly closing, with comprehensive benchmark data.

The Open-Weight Counterattack Has Begun

In February 2026, a shockwave hit the AI industry. MiniMax M2.5, released by the Chinese AI startup MiniMax, scored higher than proprietary models across multiple benchmarks including coding, agentic tasks, and search.

The post gathered over 362 points on Reddit’s r/LocalLLaMA, sparking active discussion about “open-weight models finally catching up to closed models.” In this article, we analyze M2.5’s specific performance data and the shifting landscape of open vs. closed models.

MiniMax M2.5 Key Specifications

MiniMax M2.5 is a 229B parameter open-weight model freely available on HuggingFace.

  • Parameters: 229B (MoE architecture)
  • Training: Reinforcement learning across 200,000+ real-world environments
  • Inference Speed: 100 tokens/second (Lightning version)
  • Languages: Go, C, C++, TypeScript, Rust, Python, Java, and 10+ more
  • Deployment: SGLang, vLLM, Transformers, KTransformers supported

Benchmark Comparison: The Gap With Closed Models Approaches Zero

SWE-Bench Verified (Coding)

SWE-Bench Verified measures the ability to resolve real GitHub issues.

ModelScoreType
MiniMax M2.580.2%Open-weight
Claude Opus 4.6Proprietary
MiniMax M2.1Open-weight

Results across different agent harnesses are particularly noteworthy:

  • Droid harness: M2.5 (79.7%) > Opus 4.6 (78.9%)
  • OpenCode harness: M2.5 (76.1%) > Opus 4.6 (75.9%)

In both environments, the open-weight model edged out the proprietary model — a historic result.

Multi-SWE-Bench (Multi-Repository)

M2.5 achieved 51.3% on tasks spanning multiple repositories, demonstrating strong performance in complex real-world scenarios.

BrowseComp (Search & Tool Use)

On BrowseComp, which measures web search and tool-calling abilities, M2.5 scored 76.3% (with context management), reaching industry-leading levels.

The Cost Revolution: Dominance in Price, Not Just Performance

The impact of M2.5 extends beyond performance. The cost-performance ratio is in a different league.

MetricM2.5 LightningM2.5 Standard
Input Price$0.3/M tokens$0.15/M tokens
Output Price$2.4/M tokens$1.2/M tokens
Inference Speed100 TPS50 TPS
1-hour Continuous Cost$1.0$0.3

Compared to Claude Opus, Gemini 3 Pro, and GPT-5, M2.5’s output token cost is 1/10th to 1/20th of the price.

Why M2.5 Evolved So Rapidly

Massive RL Scaling

MiniMax developed an in-house agent-native RL framework called Forge.

graph TD
    A[Forge RL Framework] --> B[200K+ Real Environments]
    A --> C[CISPO Algorithm]
    A --> D[Process Reward Mechanism]
    B --> E[Coding Envs]
    B --> F[Search Envs]
    B --> G[Office Work Envs]
    C --> H[Stable MoE Training]
    D --> I[Long-Context Quality Monitoring]
    E & F & G --> J[M2.5]
    H & I --> J

Key technical highlights:

  • Async scheduling optimization: Balancing system throughput against sample off-policyness
  • Tree-structured merge strategy: ~40x training speedup for sample combining
  • CISPO algorithm: Ensuring MoE model stability during large-scale training
  • Process rewards: Addressing credit assignment in long-context agent rollouts

Emergent Spec-Writing Ability

A remarkable aspect of M2.5 is that the ability to design and plan like an architect before writing code emerged naturally during training. The model actively decomposes and plans project features, structure, and UI design before coding.

The Shifting Open vs. Closed Landscape

A Historic Turning Point

Until now, the AI industry operated under an implicit assumption: “the best-performing models are always proprietary.” M2.5 is changing that.

graph LR
    subgraph 2024
        A[Closed<br/>Dominant] --> B[Open<br/>Far Behind]
    end
    subgraph Late 2025
        C[Closed<br/>Slight Edge] --> D[Open<br/>Catching Up]
    end
    subgraph Early 2026
        E[Closed<br/>On Par] --- F[Open<br/>Surpassing in Areas]
    end

What This Means for Enterprises

  1. Avoiding Vendor Lock-in: If open-weight models deliver frontier performance, dependency on specific API vendors can be reduced
  2. Customization Freedom: Fine-tuning with proprietary data and domain specialization becomes possible
  3. Cost Optimization: Self-hosting for cost control; even M2.5’s API is 1/10th to 1/20th the cost
  4. Data Privacy: No need to send sensitive data to external providers

The Rapid Evolution of the M2 Series

In just 3.5 months (late October 2025 to February 2026), MiniMax released three generations: M2, M2.1, and M2.5.

VersionReleaseSWE-Bench ImprovementNotable
M2Late Oct 2025Baseline450K HuggingFace downloads
M2.1Dec 2025Major improvement86.7K downloads
M2.5Feb 202680.2% SOTA37% faster, 1/10 cost

Internal Production Adoption

MiniMax actively uses M2.5 within their own organization:

  • 30% of company-wide tasks autonomously completed by M2.5
  • Spanning R&D, product, sales, HR, and finance
  • 80% of newly committed code generated by M2.5

Conclusion: Three Key Takeaways

  1. The Performance Gap Has Vanished: An open-weight model has surpassed closed models on SWE-Bench. This is not a fluke — it’s the beginning of a structural shift

  2. Cost Revolution: M2.5 delivers equal or better performance at 1/10th to 1/20th the cost of Opus. The “frontier model you don’t have to worry about cost for” is now real

  3. Expanding Choices: Enterprises no longer need to default to proprietary models. Self-hosting, customization, and cost optimization through open-weight models are practical options

For AI developers, 2026 may mark the dawn of a golden age for open-weight models.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.