Claude Code with Local Models Triggers Full Prompt Reprocessing — An Architecture Inefficiency
Analyzing the full prompt reprocessing issue when running Claude Code with local LLMs. Learn about KV cache invalidation mechanics and developer tool design lessons.
Overview
Claude Code is Anthropic’s CLI-based AI coding assistant. While it works seamlessly through the official API, running it with local LLMs or third-party proxies causes a full prompt reprocessing on every single request — a serious inefficiency. This issue was recently reported on the Reddit r/LocalLLaMA community and garnered significant attention.
In this article, we analyze the technical root cause, its performance impact, and the solution.
The Core Problem: x-anthropic-billing-header
Claude Code internally embeds a billing header in the system prompt:
x-anthropic-billing-header: cc_version=2.1.39.c39; cc_entrypoint=cli; cch=56445;
The values in this header change with every request. While the official Anthropic API handles this header separately, local models and third-party proxies render it as part of the system prompt text.
KV Cache Invalidation Mechanism
In LLM inference, the KV (Key-Value) cache is essential for performance optimization. It stores computation results from previous requests and reuses them for subsequent requests with the same prefix.
graph TD
A[User Request] --> B{System Prompt<br/>Changed?}
B -->|No Change| C[KV Cache Hit<br/>Fast Response]
B -->|Changed| D[Full Reprocessing<br/>Slow Response]
D --> E[New KV Cache Generated]
C --> F[Incremental Processing Only]
When the billing header value changes every time, the entire system prompt is perceived as modified, causing complete KV cache invalidation. As a result, thousands to tens of thousands of tokens in the system prompt and conversation history are reprocessed from scratch.
Performance Impact Analysis
Cost on Local LLMs
The impact of KV cache invalidation is particularly severe on local model execution:
| Metric | KV Cache Active | KV Cache Invalidated |
|---|---|---|
| System Prompt Processing | First request only | Every request |
| Conversation Context Processing | Incremental only | Full reprocessing |
| VRAM Usage | Stable | Spike-and-release cycles |
| Response Latency | 0.5–2 seconds | 10–30+ seconds |
| GPU Compute Cost | Low | Very high |
Claude Code’s system prompt can reach tens of thousands of tokens. Combined with conversation history, each request ends up reprocessing hundreds of thousands of tokens. On local GPUs, this results in response times increasing by 10x or more.
Cost with API Proxies
The same issue occurs when connecting to other models (e.g., GPT-4, Gemini) through third-party API proxies. When caching is invalidated on APIs that support prompt caching, token costs multiply several times over.
Solution
Environment Variable Configuration
The simplest fix is disabling the billing header. Add the following to your ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}
}
This removes the billing header from the system prompt, allowing the KV cache to function properly.
Verifying the Fix
After applying the setting, verify the following:
- Improved response speed: Response times should decrease significantly from the second request onward
- Stable VRAM: GPU memory usage fluctuations should be reduced
- Log verification: Confirm that the billing header no longer appears in the system prompt
Architecture Design Lessons
This issue goes beyond a simple bug — it provides important lessons for developer tool design.
1. Separate Metadata from Content
Billing information, telemetry data, and other metadata should be clearly separated from prompt content. HTTP headers, separate API parameters, or out-of-band channels are the correct approach.
graph LR
subgraph Wrong Design
A1[System Prompt] --> B1[Billing Header Included<br/>Changes Every Time]
end
subgraph Correct Design
A2[System Prompt] --> B2[Pure Instructions<br/>No Changes]
A3[HTTP Headers] --> B3[Billing Metadata<br/>Separate Channel]
end
2. Cache-Friendly Design
When designing LLM-based tools, prompt cache friendliness must be a priority:
- Don’t place frequently changing elements at the beginning of prompts
- Structurally separate static and dynamic content
- Minimize elements that affect cache keys
3. Third-Party Compatibility
Even if something works fine with the official API, designs should account for third-party environment compatibility — especially for tools actively used by the open-source community.
The Bigger Picture: Future of LLM Developer Tools
This case highlights challenges facing the LLM developer tool ecosystem:
- Vendor lock-in: Tools optimized for specific APIs perform inefficiently in other environments
- Lack of transparency: Undisclosed internal architectures make debugging difficult
- Community dependence: User communities discover and share solutions themselves
Moving forward, developer tools should aim for model-agnostic design, increase transparency of internal operations, and officially support diverse execution environments.
Conclusion
The full prompt reprocessing issue in Claude Code with local models can be resolved by setting the CLAUDE_CODE_ATTRIBUTION_HEADER environment variable to "0". However, the implications extend far beyond this fix.
When developing or operating LLM-based tools, cache efficiency, metadata separation, and third-party compatibility should be considered from the earliest design stages. The fact that a single small header can dramatically alter an entire system’s performance is a powerful reminder of the importance of meticulous architecture design.
References
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕