mcp2cli — Cut MCP Token Costs by 96–99% with CLI-Based Tool Discovery

Overview

As Model Context Protocol (MCP) becomes the standard for connecting AI agents to external tools and APIs, a new bottleneck has emerged: tool schema token waste.

When you connect MCP servers, every tool’s JSON schema is injected into the LLM’s context window on every single conversation turn—whether or not the model uses those tools. With 30 tools, that’s approximately 3,600 tokens burned per turn doing nothing. Scale to 120 tools over a 25-turn conversation and you’re looking at 362,000 tokens consumed by schemas alone.

mcp2cli solves this with CLI-based on-demand tool discovery. Instead of preloading all schemas upfront, the model queries --list and --help only when needed—cutting token waste by 96–99%.

The Problem: Cost of Upfront Schema Injection

How Traditional MCP Integration Works

[Conversation Start]
System Prompt + ALL tool schemas (30 tools × 121 tokens = 3,630 tokens)
│
Turn 1: User message + ALL schemas re-injected
Turn 2: User message + ALL schemas re-injected
Turn 3: User message + ALL schemas re-injected
...
Turn 15: User message + ALL schemas re-injected

30 tools × 15 turns = 54,450 tokens consumed by schemas alone—regardless of whether the model called any tool on that turn.

Measured Token Costs

Scenario	Native MCP	mcp2cli	Savings
30 tools, 15 turns	54,525 tokens	2,309 tokens	96%
80 tools, 20 turns	193,240 tokens	3,871 tokens	98%
120 tools, 25 turns	362,350 tokens	5,181 tokens	99%
200-endpoint API, 25 turns	358,425 tokens	3,925 tokens	99%

The more tools you have and the longer the conversation, the greater the savings. At enterprise scale, this changes the cost structure entirely.

How mcp2cli Works

Core Idea: Schema Preload → On-Demand Discovery

[Traditional approach]
All tool schemas → always in context

[mcp2cli approach]
Tool list only (~16 tokens/tool) → model calls --help only when needed (~120 tokens/tool)

The LLM receives only tool names and brief descriptions via mcp2cli --list, then calls mcp2cli [tool-name] --help only when it actually wants to use that tool.

Four-Stage Processing Pipeline

1. Spec Loading
   Read MCP server URL or OpenAPI spec file

2. Tool Definition Extraction
   Parse tool names, parameters, and descriptions from schema

3. Argument Parser Generation
   Dynamically create CLI commands for each tool (no codegen, runtime only)

4. Execution
   Forward as HTTP or tool-call request to the MCP server

Installation and Basic Usage

# Install
pip install mcp2cli

# List available tools (~16 tokens/tool)
mcp2cli --mcp https://server.url/sse --list

# Get specific tool details (~120 tokens, only when needed)
mcp2cli --mcp https://server.url/sse search-files --help

# Use with OpenAPI spec
mcp2cli --spec api.json --base-url https://api.com list-items

# TOON format (Token-Optimized Output Notation)
mcp2cli --mcp https://server.url/sse search-files --toon

Zero Codegen: Why It Matters

mcp2cli reads specs at runtime and generates the CLI dynamically. No code generation means:

New tools added to the MCP server appear automatically on the next invocation
No spec files to commit or maintain
Intelligent 1-hour TTL caching prevents unnecessary reloads

Engineering Manager’s Perspective: Adoption Strategy

Calculating the Business Impact

Assume your team operates an AI agent with 100 MCP tools integrated.

Native MCP (1,000 conversations/day, 20 turns avg):
  100 tools × 121 tokens × 20 turns × 1,000 conversations = 242,000,000 tokens/day

After mcp2cli (98% savings):
  ~4,840,000 tokens/day

Difference: 237,160,000 tokens/day
At Claude Sonnet 4.6 pricing ($3/MTok): ~$711/day saved, ~$21,000/month

Beyond cost, keeping the context window clean directly affects model reasoning quality and latency.

Understanding the Trade-offs

mcp2cli isn’t a silver bullet. The Hacker News discussion (133 upvotes, 92 comments) surfaced key concerns:

Additional round-trips: The model needs a separate --help call the first time it uses a tool. For short tasks, this can actually increase latency.

Discovery error potential: The model might try incorrect tool names or misinterpret --help output.

Optimal use cases: Tools 20+, conversations 10+ turns, with most tools unused on most turns.

Adoption Roadmap

Step 1: Measure
   Track actual tool schema token consumption in current AI agents
   (Check system prompt token count in conversation logs)

Step 2: Pilot
   Apply mcp2cli to the one agent with the most MCP tool integrations
   A/B test: compare cost, accuracy, and latency

Step 3: Analyze
   Identify which tools are actually used frequently
   Consider hybrid: preload frequent tools, on-demand for the rest

Step 4: Scale
   Roll out to all agents after validating effectiveness

What the Hacker News Community Said

Reactions were mixed, which is worth understanding:

Positive responses:

“Applying the lazy loading pattern to LLM tool discovery is elegant”
“This could be a game changer for large-scale MCP environments”

Critical responses:

“Token savings don’t automatically guarantee better outputs”
“Extra round-trips for tool discovery increase latency and introduce potential for errors”
“Benchmarks skew toward ideal scenarios”

In practice, validate against your actual workloads rather than trusting benchmarks at face value.

Production Considerations

MCP Server Type Compatibility

✅ HTTP/SSE MCP servers: Full support
✅ stdio MCP servers: Supported
✅ OpenAPI JSON/YAML: Supported
⚠️  Auth-required servers: Built-in OAuth support, requires configuration

Caching Strategy

# Default caching: 1-hour TTL
mcp2cli --mcp server.url --cache-ttl 3600 --list

# Force refresh
mcp2cli --mcp server.url --no-cache --list

Use --no-cache in development where specs change frequently; increase TTL in stable production environments.

Takeaway

The problem mcp2cli solves is simple but real. As the MCP ecosystem matures and the number of integrated servers and tools grows, schema injection costs don’t scale linearly—they grow with every tool and every turn.

30 tools: May not justify the change
80+ tools: Monthly costs start to look noticeably different
120+ tools: This becomes a survival strategy, not just optimization

Beyond token savings, keeping the context window clean has a positive effect on actual model reasoning quality. Reducing noise in the context window is becoming as important as prompt engineering itself.

References

Reading Complete!

mcp2cli — Cut MCP Token Costs by 96–99% with CLI-Based Tool Discovery

Overview

The Problem: Cost of Upfront Schema Injection

How Traditional MCP Integration Works

Measured Token Costs

How mcp2cli Works

Core Idea: Schema Preload → On-Demand Discovery

Four-Stage Processing Pipeline

Installation and Basic Usage

Zero Codegen: Why It Matters

Engineering Manager’s Perspective: Adoption Strategy

Calculating the Business Impact

Understanding the Trade-offs

Adoption Roadmap

What the Hacker News Community Said

Production Considerations

MCP Server Type Compatibility

Caching Strategy

Takeaway

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Reading Complete!

Overview

The Problem: Cost of Upfront Schema Injection

How Traditional MCP Integration Works

Measured Token Costs

How mcp2cli Works

Core Idea: Schema Preload → On-Demand Discovery

Four-Stage Processing Pipeline

Installation and Basic Usage

Zero Codegen: Why It Matters

Engineering Manager’s Perspective: Adoption Strategy

Calculating the Business Impact

Understanding the Trade-offs

Adoption Roadmap

What the Hacker News Community Said

Production Considerations

MCP Server Type Compatibility

Caching Strategy

Takeaway

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Related Articles

Complete Guide to My MCP Server Toolkit

MCP Code Execution in Practice: Improving Claude Code Project Structure

Context Engineering: The Core Skill Behind Production AI Agents