mcp2cli — Cut MCP Token Costs by 96–99% with CLI-Based Tool Discovery
Connecting MCP servers injects all tool schemas into context every turn—362,000 tokens wasted for 120 tools over 25 turns. mcp2cli solves this with CLI-based on-demand discovery, cutting costs by 96–99%. Here's how it works and when to use it.
Overview
As Model Context Protocol (MCP) becomes the standard for connecting AI agents to external tools and APIs, a new bottleneck has emerged: tool schema token waste.
When you connect MCP servers, every tool’s JSON schema is injected into the LLM’s context window on every single conversation turn—whether or not the model uses those tools. With 30 tools, that’s approximately 3,600 tokens burned per turn doing nothing. Scale to 120 tools over a 25-turn conversation and you’re looking at 362,000 tokens consumed by schemas alone.
mcp2cli solves this with CLI-based on-demand tool discovery. Instead of preloading all schemas upfront, the model queries --list and --help only when needed—cutting token waste by 96–99%.
The Problem: Cost of Upfront Schema Injection
How Traditional MCP Integration Works
[Conversation Start]
System Prompt + ALL tool schemas (30 tools × 121 tokens = 3,630 tokens)
│
Turn 1: User message + ALL schemas re-injected
Turn 2: User message + ALL schemas re-injected
Turn 3: User message + ALL schemas re-injected
...
Turn 15: User message + ALL schemas re-injected
30 tools × 15 turns = 54,450 tokens consumed by schemas alone—regardless of whether the model called any tool on that turn.
Measured Token Costs
| Scenario | Native MCP | mcp2cli | Savings |
|---|---|---|---|
| 30 tools, 15 turns | 54,525 tokens | 2,309 tokens | 96% |
| 80 tools, 20 turns | 193,240 tokens | 3,871 tokens | 98% |
| 120 tools, 25 turns | 362,350 tokens | 5,181 tokens | 99% |
| 200-endpoint API, 25 turns | 358,425 tokens | 3,925 tokens | 99% |
The more tools you have and the longer the conversation, the greater the savings. At enterprise scale, this changes the cost structure entirely.
How mcp2cli Works
Core Idea: Schema Preload → On-Demand Discovery
[Traditional approach]
All tool schemas → always in context
[mcp2cli approach]
Tool list only (~16 tokens/tool) → model calls --help only when needed (~120 tokens/tool)
The LLM receives only tool names and brief descriptions via mcp2cli --list, then calls mcp2cli [tool-name] --help only when it actually wants to use that tool.
Four-Stage Processing Pipeline
1. Spec Loading
Read MCP server URL or OpenAPI spec file
2. Tool Definition Extraction
Parse tool names, parameters, and descriptions from schema
3. Argument Parser Generation
Dynamically create CLI commands for each tool (no codegen, runtime only)
4. Execution
Forward as HTTP or tool-call request to the MCP server
Installation and Basic Usage
# Install
pip install mcp2cli
# List available tools (~16 tokens/tool)
mcp2cli --mcp https://server.url/sse --list
# Get specific tool details (~120 tokens, only when needed)
mcp2cli --mcp https://server.url/sse search-files --help
# Use with OpenAPI spec
mcp2cli --spec api.json --base-url https://api.com list-items
# TOON format (Token-Optimized Output Notation)
mcp2cli --mcp https://server.url/sse search-files --toon
Zero Codegen: Why It Matters
mcp2cli reads specs at runtime and generates the CLI dynamically. No code generation means:
- New tools added to the MCP server appear automatically on the next invocation
- No spec files to commit or maintain
- Intelligent 1-hour TTL caching prevents unnecessary reloads
Engineering Manager’s Perspective: Adoption Strategy
Calculating the Business Impact
Assume your team operates an AI agent with 100 MCP tools integrated.
Native MCP (1,000 conversations/day, 20 turns avg):
100 tools × 121 tokens × 20 turns × 1,000 conversations = 242,000,000 tokens/day
After mcp2cli (98% savings):
~4,840,000 tokens/day
Difference: 237,160,000 tokens/day
At Claude Sonnet 4.6 pricing ($3/MTok): ~$711/day saved, ~$21,000/month
Beyond cost, keeping the context window clean directly affects model reasoning quality and latency.
Understanding the Trade-offs
mcp2cli isn’t a silver bullet. The Hacker News discussion (133 upvotes, 92 comments) surfaced key concerns:
Additional round-trips: The model needs a separate --help call the first time it uses a tool. For short tasks, this can actually increase latency.
Discovery error potential: The model might try incorrect tool names or misinterpret --help output.
Optimal use cases: Tools 20+, conversations 10+ turns, with most tools unused on most turns.
Adoption Roadmap
Step 1: Measure
Track actual tool schema token consumption in current AI agents
(Check system prompt token count in conversation logs)
Step 2: Pilot
Apply mcp2cli to the one agent with the most MCP tool integrations
A/B test: compare cost, accuracy, and latency
Step 3: Analyze
Identify which tools are actually used frequently
Consider hybrid: preload frequent tools, on-demand for the rest
Step 4: Scale
Roll out to all agents after validating effectiveness
What the Hacker News Community Said
Reactions were mixed, which is worth understanding:
Positive responses:
- “Applying the lazy loading pattern to LLM tool discovery is elegant”
- “This could be a game changer for large-scale MCP environments”
Critical responses:
- “Token savings don’t automatically guarantee better outputs”
- “Extra round-trips for tool discovery increase latency and introduce potential for errors”
- “Benchmarks skew toward ideal scenarios”
In practice, validate against your actual workloads rather than trusting benchmarks at face value.
Production Considerations
MCP Server Type Compatibility
✅ HTTP/SSE MCP servers: Full support
✅ stdio MCP servers: Supported
✅ OpenAPI JSON/YAML: Supported
⚠️ Auth-required servers: Built-in OAuth support, requires configuration
Caching Strategy
# Default caching: 1-hour TTL
mcp2cli --mcp server.url --cache-ttl 3600 --list
# Force refresh
mcp2cli --mcp server.url --no-cache --list
Use --no-cache in development where specs change frequently; increase TTL in stable production environments.
Takeaway
The problem mcp2cli solves is simple but real. As the MCP ecosystem matures and the number of integrated servers and tools grows, schema injection costs don’t scale linearly—they grow with every tool and every turn.
- 30 tools: May not justify the change
- 80+ tools: Monthly costs start to look noticeably different
- 120+ tools: This becomes a survival strategy, not just optimization
Beyond token savings, keeping the context window clean has a positive effect on actual model reasoning quality. Reducing noise in the context window is becoming as important as prompt engineering itself.
References
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕