Deep-Thinking Ratio: Cut LLM Inference Costs by 50% Without Sacrificing Quality

The “Longer = Better” Assumption Was Wrong

For the past few years, the LLM reasoning world operated on a simple axiom: the longer your Chain-of-Thought, the more accurate your answer. This principle underpins the design of o1, o3, and Claude’s Extended Thinking — more tokens equal higher quality. It became industry orthodoxy.

In February 2026, researchers from the University of Virginia and Google published a paper that directly challenges this assumption: “Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens” (arXiv:2602.13517). Their alternative: the Deep-Thinking Ratio (DTR).

What Is the Deep-Thinking Ratio?

Core Concept: Measuring Reasoning Depth

DTR measures the proportion of tokens in a reasoning sequence where genuine deep processing occurs.

A Deep-Thinking Token is one where the model’s predictions undergo significant revision between early (shallow) layers and late (deep) layers. In other words, these are the tokens where the model actually “thinks harder” before settling on an output.

DTR = (Deep-Thinking Tokens) / (Total Reasoning Tokens)

Length vs. Depth: The Correlation Data

The research team tested 22 models including GPT-4o, Claude 3.7, Gemini 2.5 Pro, and o4-mini-high.

Metric	Correlation with Accuracy	Interpretation
Reasoning Length (token count)	r = -0.59	Negative correlation — longer often means worse
DTR (reasoning depth ratio)	r = +0.683	Strong positive correlation — deeper means better

The implication is clear: long reasoning chains are often a symptom of “overthinking”, and can actually be inversely related to output quality.

Think@n: The DTR-Based Cost Reduction Algorithm

The research team proposes a practical algorithm called Think@n that applies DTR to eliminate wasteful computation.

How It Works

1. Begin generating n reasoning candidates in parallel
2. Generate only the first 50 tokens of each candidate
3. Calculate DTR for each 50-token prefix
4. Immediately terminate candidates with low DTR
5. Continue generating only the high-DTR candidates to completion

The key insight: just 50 tokens is enough to determine whether a reasoning path is “thinking deeply.”

Results: AIME 25 Benchmark

On the AIME 2025 benchmark (challenging math problems), Think@n delivered:

Standard Voting (baseline):
  - Accuracy: baseline
  - Cost: 100%

Think@n:
  - Accuracy: higher than baseline
  - Cost: ~51% (49% reduction)

This isn’t just a cost trade-off. Think@n achieves higher accuracy at half the cost.

Practical Implications for Engineering Managers and VPoEs

1. Rethink Your Token Budget Policy

Most teams today operate on “more context, more tokens = better results.” DTR research shows this assumption can be fundamentally wrong.

Concrete actions to consider:

Differentiate task types: Identify which tasks genuinely require deep reasoning vs. which don’t. Stop applying maximum token budgets uniformly.
Implement early stopping logic: Build pipelines that can detect low-DTR signals and terminate reasoning early.
Parallel generation + filtering: Kick off multiple reasoning paths simultaneously, but cut the underperformers after 50 tokens.

2. AI Agent Pipeline Redesign

For AI agents doing complex reasoning, DTR becomes a powerful optimization lever.

# Conceptual implementation
def think_at_n(problem, n_candidates=5, prefix_length=50):
    candidates = []

    # Initialize n reasoning paths
    for i in range(n_candidates):
        prefix = generate_tokens(problem, max_tokens=prefix_length)
        dtr = calculate_dtr(prefix)
        candidates.append((prefix, dtr))

    # Filter by DTR: keep only top performers
    threshold = median([c[1] for c in candidates])
    promising = [c for c in candidates if c[1] >= threshold]

    # Complete only the promising candidates
    results = [complete_generation(c[0]) for c in promising]
    return best_of(results)

3. Expand Your Cost Monitoring Metrics

Traditional AI cost monitoring focuses on token counts and API call volumes. DTR opens up a new dimension:

Existing Metric	DTR-Enhanced Version
Total token count	Deep-thinking vs. shallow-thinking token ratio
Response length	Quality-per-token ratio
API cost	Cost proportional to actual reasoning effort

Current Limitations of DTR

Applying DTR in production today has several constraints worth acknowledging:

1. Requires access to model internals DTR requires access to intermediate layer hidden states. This information isn’t exposed through commercial APIs like GPT-4o or Claude today.

2. Immediately applicable with open-source models Teams deploying Llama 3.1, Qwen 3, Mistral, or other open-source models can implement DTR-based optimization right now.

3. Awaiting vendor support Longer term, expect Anthropic, OpenAI, and Google to either adopt DTR-based optimization at the API layer or expose reasoning efficiency metrics as part of their offerings.

Actionable Insights You Can Apply Today

Even without direct DTR calculation via commercial APIs, the research offers immediate practical takeaways:

Stop equating token length with reasoning quality. Simply raising max token limits is likely costing you money without improving results. Real-world LLM API cost experiments confirm this same pattern.

Experiment with Best-of-N strategies now. The core idea behind Think@n — start multiple paths, quickly abandon the unpromising ones — can be implemented today using other heuristics like confidence scores or perplexity in place of DTR.

Diversify your reasoning approaches. For complex tasks, multiple independent short reasoning chains may outperform a single long one. Test this in your domain before assuming longer is better.

Conclusion

DTR represents a genuine paradigm shift in how we measure and optimize LLM reasoning — from “think longer” to “think deeper.” For engineering leaders managing AI infrastructure, the bottom line is compelling: there’s now a theoretical and empirical foundation for cutting inference costs in half while improving accuracy.

If your team is running open-source models, you have everything you need to start experimenting with DTR-based optimization today. And if you’re on commercial APIs, watch for vendor adoption in the coming months. Paired with the declining trajectory of AI training costs, inference efficiency optimization is becoming the next real competitive edge.

References

Paper: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens (arXiv:2602.13517)
Google’s Deep-Thinking Ratio: Cut LLM Costs by 50%
MarkTechPost Article

Deep-Thinking Ratio: Cut LLM Inference Costs by 50% Without Sacrificing Quality

The “Longer = Better” Assumption Was Wrong

What Is the Deep-Thinking Ratio?

Core Concept: Measuring Reasoning Depth

Length vs. Depth: The Correlation Data

Think@n: The DTR-Based Cost Reduction Algorithm

How It Works

Results: AIME 25 Benchmark

Practical Implications for Engineering Managers and VPoEs

1. Rethink Your Token Budget Policy

2. AI Agent Pipeline Redesign

3. Expand Your Cost Monitoring Metrics

Current Limitations of DTR

Actionable Insights You Can Apply Today

Conclusion

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

The “Longer = Better” Assumption Was Wrong

What Is the Deep-Thinking Ratio?

Core Concept: Measuring Reasoning Depth

Length vs. Depth: The Correlation Data

Think@n: The DTR-Based Cost Reduction Algorithm

How It Works

Results: AIME 25 Benchmark

Practical Implications for Engineering Managers and VPoEs

1. Rethink Your Token Budget Policy

2. AI Agent Pipeline Redesign

3. Expand Your Cost Monitoring Metrics

Current Limitations of DTR

Actionable Insights You Can Apply Today

Conclusion

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Related Articles

Karpathy: AI Training Costs Drop 40% Per Year — How Deflation Is Reshaping the Industry

AI Agent Cost vs Human Labor: A Honest Analysis from Running 8 Agents

ASIC Inference Chip Runs Llama 3.1 8B at 16,000 tok/s — The GPU-Free AI Inference Era