Cut Agent Fleet Costs by 90% with Heterogeneous LLM Architecture
The Plan-Execute pattern: large models plan, small models execute. A practical guide for EMs and CTOs on heterogeneous LLM architecture strategies to dramatically reduce agent fleet costs without sacrificing quality.
Every engineering team running an agent fleet faces the same uncomfortable reality: LLM API bills that grow faster than expected. Using frontier models like Claude Opus 4.6 or GPT-5.3 for every single task guarantees quality—but the costs quickly become unsustainable at production scale.
Here’s the insight that changes everything: not all tasks in your agent system actually require frontier-level reasoning. In this post, I’ll walk through the Heterogeneous LLM Architecture strategy that EMs and CTOs need to understand to cut costs by up to 90% while maintaining quality.
Why Single-Model Strategies Fail at Scale
Most teams start by picking the best model they can find and routing everything through it. This makes sense during prototyping—you eliminate quality concerns and can focus on validating the product.
The problem emerges at production scale.
Consider a system processing 10,000 agent tasks per day. Using Claude Opus 4.6 for everything costs roughly $800–$1,300 per day, or $300,000–$500,000 annually. That’s a significant budget line for any startup or mid-sized company.
But here’s the question you need to ask: how many of those 10,000 tasks actually need frontier-level reasoning? Analyzing production systems consistently reveals the same pattern: complex strategic reasoning is required for only 10–20% of tasks. The remaining 80–90% involve data formatting, text classification, simple summarization, and routing decisions—work that much smaller models handle perfectly well.
The Three-Tier Model Architecture
The core principle of heterogeneous LLM architecture is matching model capability to task complexity. Think of it like staffing: you don’t assign a senior principal engineer to fix a CSS bug, and you don’t route every LLM call through your most expensive model.
The standard implementation uses three tiers:
Tier 1: Frontier Models
- Examples: Claude Opus 4.6, GPT-5.3, Gemini 3.1 Pro
- Use cases: Complex strategy formulation, multi-step reasoning, code architecture design, interpreting ambiguous requirements
- Cost: High (used for only 10–20% of total calls)
Tier 2: Mid-Tier Models
- Examples: Claude Sonnet 4.6, GPT-4o, Gemini 1.5 Flash
- Use cases: Document summarization, code review, moderate data analysis, language translation
- Cost: Medium (used for 30–40% of total calls)
Tier 3: Small Language Models
- Examples: Claude Haiku 4.5, GPT-4o-mini, Phi-3, Qwen3-Coder
- Use cases: Routing decisions, text classification, format conversion, keyword extraction, simple Q&A
- Cost: Low (used for 40–60% of total calls)
The Plan-Execute Pattern: Maximum Cost Impact
The most impactful pattern in heterogeneous architecture is Plan-Execute. The concept is elegantly simple:
Planning Phase A frontier model analyzes the overall task and generates a detailed execution plan—breaking it into subtasks, specifying what tools and data each subtask needs, and anticipating edge cases. The output is a precise specification.
Execution Phase Small models execute each subtask following the specification. Because the “what to do” is already clearly defined, small models can perform with high accuracy without needing the frontier model’s full reasoning capability.
Research shows this pattern can reduce costs by up to 90%—using the expensive frontier model for just 5–10% of total work, while small models handle the rest.
# Plan-Execute Pattern Implementation
import anthropic
class HeterogeneousAgentFleet:
def __init__(self):
self.client = anthropic.Anthropic()
self.planner_model = "claude-opus-4-6" # Frontier: complex planning
self.executor_model = "claude-haiku-4-5-20251001" # Small: execution
self.reviewer_model = "claude-sonnet-4-6" # Mid-tier: validation
def plan_task(self, task: str) -> dict:
"""Frontier model generates execution plan"""
response = self.client.messages.create(
model=self.planner_model,
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Decompose this task into small subtasks and return
execution specs as JSON.
Task: {task}
Format: {{"subtasks": [{{"id": 1, "instruction": "...", "expected_output": "..."}}]}}
"""
}]
)
return self._parse_plan(response.content[0].text)
def execute_subtask(self, subtask: dict) -> str:
"""Small model executes individual subtask"""
response = self.client.messages.create(
model=self.executor_model,
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Instruction: {subtask['instruction']}
Expected output format: {subtask['expected_output']}"""
}]
)
return response.content[0].text
def run(self, task: str) -> list:
"""Execute full Plan-Execute pipeline"""
plan = self.plan_task(task) # Cost: ~$0.015 (Opus, 1 call)
results = []
for subtask in plan["subtasks"]:
result = self.execute_subtask(subtask) # Cost: ~$0.0003 (Haiku, 1 call)
results.append(result)
return results
The Routing Layer: Automating Model Selection
A key component that makes heterogeneous architecture practical is the routing layer—the automated system that classifies every incoming request and directs it to the appropriate model tier. Without this, engineers end up manually deciding which model to use, eliminating the efficiency gains.
An effective routing layer evaluates:
- Complexity: Does the request require multi-step reasoning?
- Domain specificity: Does it need specialized knowledge or nuanced judgment?
- Context length: Does it require processing long contexts?
- Risk level: Is accuracy critical and failures unacceptable?
The elegant part: routing decisions themselves can be handled by small models, since classification is exactly the kind of task small models excel at.
Cost Analysis: Before and After
Applying heterogeneous architecture to a production system processing 10,000 tasks/day:
Before (Single-Model Strategy)
- All 10,000 calls through Claude Opus 4.6
- Average: 1,000 input tokens, 500 output tokens per call
- Daily cost: ~$900
After (Heterogeneous Architecture)
- Frontier (Opus) 10%: 1,000 calls × $0.09 = $90
- Mid-tier (Sonnet) 35%: 3,500 calls × $0.009 = $31.50
- Small (Haiku) 55%: 5,500 calls × $0.00063 = $3.50
- Total daily cost: ~$125
Result: 86% cost reduction, $23,000 saved per month
Real-world results vary based on task distribution and token usage, but 50–90% cost reduction is a realistic target for most agent systems.
The 3-Phase Migration Strategy for EMs and CTOs
Phase 1: Audit Current Usage (1 week) Analyze API logs to classify request types. Determine the complexity distribution (simple vs. complex ratio) and identify cost bottlenecks.
Phase 2: Pilot Routing (2–3 weeks) Migrate the simplest 20% of tasks to small models. Monitor quality metrics (accuracy, user feedback). Expand scope incrementally based on results.
Phase 3: Full Heterogeneous Architecture (1–2 months) Implement automated routing layer. Apply Plan-Execute pattern for complex workflows. Build a cost dashboard for continuous optimization.
Maintaining Quality While Cutting Costs
The key insight: small models fail because of unclear instructions, not model limitations. When the planning phase generates precise specifications, small models can produce high-quality output consistently.
Add a fallback mechanism: if a small model produces output that fails quality checks, automatically retry with the next tier up. Even with 5–10% fallback rate, the overall cost savings remain substantial.
Monitor cost-quality trade-offs continuously. After implementing heterogeneous architecture, track quality metrics against the baseline to verify that cost reductions don’t come at an unacceptable quality cost.
Closing Thoughts
The economics of running an agent fleet determine how aggressively your team can invest in AI capabilities. When costs are manageable, you can scale agent usage, iterate faster, and deploy AI to more parts of your product.
Using only frontier models for every task is like hiring a senior architect to write every line of code. Just as you staff appropriately for task complexity with human teams, your AI agent system should route work to the right model for each job.
The engineering leader of 2026 isn’t asking “which model should we use?”—they’re designing systems that answer “which model is optimal for each specific task, at scale?”
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕