Qwen 3.5 Goes Bankrupt on Vending-Bench 2: The Pitfall of Benchmark Obsession
Qwen 3.5, a top performer on standard benchmarks, goes bankrupt on Vending-Bench 2's vending machine simulation. Exploring the blind spots of benchmark-driven AI evaluation.
Overview
Alibaba’s large language model Qwen 3.5 Plus consistently ranks at the top of standard benchmarks like MMLU, HumanEval, and MATH. However, on Vending-Bench 2, a non-standard benchmark developed by Andon Labs, the model delivered a shocking result: bankruptcy. This finding garnered over 595 upvotes on Reddit’s r/LocalLLaMA, sparking a broader discussion about how we evaluate AI models.
What Is Vending-Bench 2?
Vending-Bench 2 is a vending machine business simulation benchmark developed by Andon Labs. It tasks AI models with running a virtual vending machine business over approximately 365 days, comprehensively measuring financial management, decision-making, and strategic planning capabilities.
Unlike traditional benchmarks, it measures practical abilities such as:
- Long-term strategic thinking: Continuous business decisions over a full year
- Financial risk management: Balancing profitability with sustainability
- Adaptability: Responding to changing simulation conditions
- Applied reasoning: Not just knowledge, but the ability to apply it
Shocking Results: Qwen 3.5 Finishes Last with Bankruptcy

The chart above shows each model’s performance over the 365-day simulation:
| Rank | Model | Final Balance (Approx.) |
|---|---|---|
| 1st | GLM-5 | ~$8,000+ |
| 2nd | Gemini 3 Flash | ~$4,000–$4,500 |
| 3rd | Kimi K2.5 | ~$3,500–$4,000 |
| 4th | Claude Opus 4.6 | ~$2,000–$2,500 |
| 5th | DeepSeek-V3.2 | ~$200–$500 |
| 6th | Qwen 3.5 Plus | ~$0 (Bankrupt) |
A model that ranks among the best on standard benchmarks finished dead last with zero balance — a truly stunning result.
Why Does This Discrepancy Exist?
The Limits of Standard Benchmarks
graph TD
A[Standard Benchmarks] --> B[Knowledge Tests<br/>MMLU, ARC]
A --> C[Coding<br/>HumanEval, MBPP]
A --> D[Math<br/>MATH, GSM8K]
A --> E[Reasoning<br/>BBH, HellaSwag]
F[Vending-Bench 2] --> G[Long-term Strategy]
F --> H[Financial Management]
F --> I[Risk Assessment]
F --> J[Adaptability]
style A fill:#e8f5e9
style F fill:#fff3e0
Standard benchmarks excel at measuring static knowledge and isolated tasks. However, they fail to capture:
- Consistency in multi-step decision-making
- Judgment under uncertainty
- Strategic thinking that accounts for long-term outcomes
- Trade-off evaluation and selection
The Benchmark Optimization Problem
In AI development, improving standard benchmark scores has become a key metric. This leads to a phenomenon known as “benchmark hacking”:
- Overfitting risk: Specializing in patterns similar to benchmark tests
- Reduced generalization: Sacrificing ability to handle unexpected tasks
- Gap between apparent and real-world performance: Great numbers, poor practical utility
Community Reaction
The Reddit r/LocalLLaMA discussion featured notable perspectives:
- “Active parameters ≠ intelligence”: Model size alone doesn’t determine capability
- Architecture matters: MoE (Mixture of Experts) routing efficiency significantly impacts results
- Training data quality: Not just quantity, but quality and diversity matter
GLM-5’s dominant performance at over $8,000 profit is also noteworthy. Models that rank below Qwen 3.5 on standard benchmarks can dramatically outperform it on practical tasks.
The Future of AI Evaluation
The Need for Multi-Dimensional Assessment
graph LR
A[Future of AI Evaluation] --> B[Standard Benchmarks<br/>Knowledge & Reasoning]
A --> C[Practical Benchmarks<br/>Vending-Bench etc.]
A --> D[Domain-Specific Eval<br/>Medical, Legal, Financial]
A --> E[Human Evaluation<br/>Chatbot Arena etc.]
B --> F[Comprehensive<br/>Model Assessment]
C --> F
D --> F
E --> F
These results clearly demonstrate that no single benchmark should determine a model’s overall capability:
- Multi-dimensional evaluation: Assessing knowledge, reasoning, practical application, and creativity
- Real-world simulations: Expanding practical benchmarks like Vending-Bench
- Domain-specific evaluation: Specialized testing aligned with intended use cases
- Continuous monitoring: Evaluation across varied conditions, not just one-time tests
Conclusion
Qwen 3.5 Plus’s bankruptcy on Vending-Bench 2 is a stark reminder of the dangers of benchmark-obsessed AI evaluation. The fact that a top-ranking model on standard benchmarks can finish last in a practical scenario underscores the need to look beyond numbers when choosing AI models.
Measuring AI’s true capabilities requires not just standardized tests but diverse benchmarks that reflect real-world complexity.
References
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕