Qwen 3.5 Goes Bankrupt on Vending-Bench 2: The Pitfall of Benchmark Obsession

Qwen 3.5 Goes Bankrupt on Vending-Bench 2: The Pitfall of Benchmark Obsession

Qwen 3.5, a top performer on standard benchmarks, goes bankrupt on Vending-Bench 2's vending machine simulation. Exploring the blind spots of benchmark-driven AI evaluation.

Overview

Alibaba’s large language model Qwen 3.5 Plus consistently ranks at the top of standard benchmarks like MMLU, HumanEval, and MATH. However, on Vending-Bench 2, a non-standard benchmark developed by Andon Labs, the model delivered a shocking result: bankruptcy. This finding garnered over 595 upvotes on Reddit’s r/LocalLLaMA, sparking a broader discussion about how we evaluate AI models.

What Is Vending-Bench 2?

Vending-Bench 2 is a vending machine business simulation benchmark developed by Andon Labs. It tasks AI models with running a virtual vending machine business over approximately 365 days, comprehensively measuring financial management, decision-making, and strategic planning capabilities.

Unlike traditional benchmarks, it measures practical abilities such as:

  • Long-term strategic thinking: Continuous business decisions over a full year
  • Financial risk management: Balancing profitability with sustainability
  • Adaptability: Responding to changing simulation conditions
  • Applied reasoning: Not just knowledge, but the ability to apply it

Shocking Results: Qwen 3.5 Finishes Last with Bankruptcy

Vending-Bench 2 Results — Money Balance Over Time (Source: Andon Labs / Reddit r/LocalLLaMA)

The chart above shows each model’s performance over the 365-day simulation:

RankModelFinal Balance (Approx.)
1stGLM-5~$8,000+
2ndGemini 3 Flash~$4,000–$4,500
3rdKimi K2.5~$3,500–$4,000
4thClaude Opus 4.6~$2,000–$2,500
5thDeepSeek-V3.2~$200–$500
6thQwen 3.5 Plus~$0 (Bankrupt)

A model that ranks among the best on standard benchmarks finished dead last with zero balance — a truly stunning result.

Why Does This Discrepancy Exist?

The Limits of Standard Benchmarks

graph TD
    A[Standard Benchmarks] --> B[Knowledge Tests<br/>MMLU, ARC]
    A --> C[Coding<br/>HumanEval, MBPP]
    A --> D[Math<br/>MATH, GSM8K]
    A --> E[Reasoning<br/>BBH, HellaSwag]
    
    F[Vending-Bench 2] --> G[Long-term Strategy]
    F --> H[Financial Management]
    F --> I[Risk Assessment]
    F --> J[Adaptability]
    
    style A fill:#e8f5e9
    style F fill:#fff3e0

Standard benchmarks excel at measuring static knowledge and isolated tasks. However, they fail to capture:

  • Consistency in multi-step decision-making
  • Judgment under uncertainty
  • Strategic thinking that accounts for long-term outcomes
  • Trade-off evaluation and selection

The Benchmark Optimization Problem

In AI development, improving standard benchmark scores has become a key metric. This leads to a phenomenon known as “benchmark hacking”:

  1. Overfitting risk: Specializing in patterns similar to benchmark tests
  2. Reduced generalization: Sacrificing ability to handle unexpected tasks
  3. Gap between apparent and real-world performance: Great numbers, poor practical utility

Community Reaction

The Reddit r/LocalLLaMA discussion featured notable perspectives:

  • “Active parameters ≠ intelligence”: Model size alone doesn’t determine capability
  • Architecture matters: MoE (Mixture of Experts) routing efficiency significantly impacts results
  • Training data quality: Not just quantity, but quality and diversity matter

GLM-5’s dominant performance at over $8,000 profit is also noteworthy. Models that rank below Qwen 3.5 on standard benchmarks can dramatically outperform it on practical tasks.

The Future of AI Evaluation

The Need for Multi-Dimensional Assessment

graph LR
    A[Future of AI Evaluation] --> B[Standard Benchmarks<br/>Knowledge & Reasoning]
    A --> C[Practical Benchmarks<br/>Vending-Bench etc.]
    A --> D[Domain-Specific Eval<br/>Medical, Legal, Financial]
    A --> E[Human Evaluation<br/>Chatbot Arena etc.]
    
    B --> F[Comprehensive<br/>Model Assessment]
    C --> F
    D --> F
    E --> F

These results clearly demonstrate that no single benchmark should determine a model’s overall capability:

  1. Multi-dimensional evaluation: Assessing knowledge, reasoning, practical application, and creativity
  2. Real-world simulations: Expanding practical benchmarks like Vending-Bench
  3. Domain-specific evaluation: Specialized testing aligned with intended use cases
  4. Continuous monitoring: Evaluation across varied conditions, not just one-time tests

Conclusion

Qwen 3.5 Plus’s bankruptcy on Vending-Bench 2 is a stark reminder of the dangers of benchmark-obsessed AI evaluation. The fact that a top-ranking model on standard benchmarks can finish last in a practical scenario underscores the need to look beyond numbers when choosing AI models.

Measuring AI’s true capabilities requires not just standardized tests but diverse benchmarks that reflect real-world complexity.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.