Implementing RLM (Recursive Language Models) in Coding Agents — Breaking Single Model Limits

Implementing RLM (Recursive Language Models) in Coding Agents — Breaking Single Model Limits

Analyzing a real implementation of MIT's RLM paper in coding agents. Learn how recursive self-invocation overcomes context limits and boosts single model performance by 91% from an engineering perspective.

Overview

When running LLM-powered coding agents, you inevitably hit a wall: the context window limit. Whether it’s 128K or 200K tokens, working with large codebases causes the model to start missing critical information. This is known as “Context Rot” — the phenomenon where performance degrades sharply as context length increases.

MIT’s Recursive Language Models (RLM) paper (arXiv:2512.24601) presents a fundamental solution to this problem. Recently, a developer named Tenobrus implemented this idea directly in Claude Code, generating significant buzz. He implemented RLM as a “skill” inside the coding agent itself.

From the perspective of an engineering manager who builds systems with AI, let me analyze why this approach matters and how it can be applied in practice.

What is RLM?

Core Idea

The essence of RLM is simple: enabling an LLM to recursively call itself.

In the conventional approach, long prompts are fed directly into the model. Naturally, anything beyond the context window gets truncated, and even within limits, Context Rot causes the model to miss information in the middle.

RLM takes a different approach:

  1. Load prompts into an external environment: Store long inputs in an execution environment like Python REPL
  2. Programmatic exploration: The LLM writes code to peek at only the necessary parts
  3. Recursive self-invocation: Delegate subtasks to new instances of itself
  4. Result synthesis: Programmatically combine results from each recursive call
graph TD
    A[Long Input 10M+ Tokens] --> B[Load into External Env<br/>Python REPL / File System]
    B --> C[Main LLM Call]
    C --> D{Parts needing analysis?}
    D -->|Chunk 1| E[Recursive Call 1]
    D -->|Chunk 2| F[Recursive Call 2]
    D -->|Chunk N| G[Recursive Call N]
    E --> H[Result Synthesis]
    F --> H
    G --> H
    H --> I[Final Response]

Key Results from the Paper

The MIT paper’s results are impressive:

  • 10M+ token processing: Over 100x the base context window
  • 91.3% performance improvement: vs baseline on BrowseComp+ benchmark
  • Context Rot solved: Virtually no performance degradation as input length increases
  • Cost-effective: Comparable to or cheaper than direct base model calls

Notably, RLM-Qwen3-8B, a natively recursive model they trained, showed an average 28.3% improvement over base Qwen3-8B and approached vanilla GPT-5 performance on some tasks.

Implementing RLM in Coding Agents

Tenobrus’s Experiment

Tenobrus implemented RLM as a “skill” within Claude Code. The key ideas:

  • Bash as the execution environment: Using Bash shell instead of Python REPL
  • Files as variables: Storing intermediate results in the file system
  • Implementation inside the coding agent: As a native agent capability, no separate infrastructure

This is effectively a structure where “a coding agent inside a coding agent recursively calls itself.”

Why Coding Agents Need RLM

In practice, coding agent operators frequently encounter these scenarios:

  1. Large-scale refactoring: Changes spanning dozens of files
  2. Code review: Analyzing hundreds of file changes in a PR
  3. Debugging: Error causes spanning multiple modules
  4. Architecture analysis: Understanding the entire codebase structure

A single context window simply cannot handle these tasks properly. With the RLM pattern, the agent can autonomously split work, recursively process each part, and synthesize results.

Implementation Architecture

The basic structure for implementing RLM in a coding agent:

graph TB
    subgraph "Coding Agent"
        A[Receive Task] --> B[Scope Analysis]
        B --> C{Fits in single<br/>context?}
        C -->|Yes| D[Direct Processing]
        C -->|No| E[Apply RLM Pattern]
        E --> F[Task Decomposition]
        F --> G[Subtask 1<br/>Recursive Call]
        F --> H[Subtask 2<br/>Recursive Call]
        F --> I[Subtask N<br/>Recursive Call]
        G --> J[Intermediate Results<br/>File Storage]
        H --> J
        I --> J
        J --> K[Result Synthesis]
        K --> L[Final Output]
        D --> L
    end

Three key points:

  1. Automatic decomposition: The LLM autonomously splits work into appropriate sizes
  2. External storage utilization: Using the file system as “memory” to bypass context window limits
  3. Programmatic synthesis: Intelligent result merging through code, not simple concatenation

Single Model Limits and Where RLM Fits

Multi-Agent vs RLM

The AI industry has recently focused on multi-agent systems to overcome single model limits — multiple models collaborating together.

RLM is a different approach. Because the same model recursively calls itself, there’s no inter-model communication overhead, and it maintains a consistent “thought process.”

ComparisonMulti-AgentRLM
Model diversityMultiple models combinableSingle model
Communication overheadHighLow
ConsistencyInter-model differencesConsistent (same model)
Context expansionDistributed processingRecursive splitting
Implementation complexityHighRelatively low

Hybrid Approach in Practice

From an EM perspective managing teams, RLM and multi-agent are not either/or but complementary.

  • RLM: Efficiently handling large contexts within a single agent
  • Multi-agent: Collaboration between agents with different specializations

The most effective architecture in practice is each agent in a multi-agent system using RLM patterns internally.

graph TD
    subgraph "Multi-Agent System"
        M[Orchestrator] --> A1[Coding Agent<br/>+ RLM]
        M --> A2[Review Agent<br/>+ RLM]
        M --> A3[Test Agent<br/>+ RLM]
    end
    
    subgraph "Inside Each Agent"
        R1[Task] --> R2[Recursive Decomposition]
        R2 --> R3[Sub-call 1]
        R2 --> R4[Sub-call 2]
        R3 --> R5[Synthesis]
        R4 --> R5
    end

Practical Application: What You Can Try Now

1. RLM Skills in Claude Code

Following Tenobrus’s approach, you can implement RLM with patterns like this:

# Example: Large codebase analysis
# Step 1: Save file list to external environment
find src/ -name "*.ts" > /tmp/rlm_files.txt

# Step 2: Recursively analyze each file
while read file; do
  # Execute subtask for each file
  analysis=$(claude --task "Summarize the key interfaces and dependencies of this file" < "$file")
  echo "$analysis" >> /tmp/rlm_summaries.txt
done < /tmp/rlm_files.txt

# Step 3: Synthesize summaries to understand overall structure
claude --task "Analyze the overall architecture based on these summaries" < /tmp/rlm_summaries.txt

2. Phased Adoption Strategy

When introducing RLM patterns in an organization, I recommend these phases:

  1. Phase 1: Apply to code review automation first (low risk)
  2. Phase 2: Extend to large-scale refactoring assistance
  3. Phase 3: Integrate into debugging workflows
  4. Phase 4: Apply across the entire development pipeline

Paper Implications: Future Outlook

The RLM paper’s implications are clear:

  1. Current LLMs are underestimated: With proper software infrastructure, performance improves dramatically
  2. The context window expansion race can be bypassed: Software-based recursion is more efficient than hardware expansion
  3. Native RLM training is the next step: Training recursion natively, like RLM-Qwen3-8B, yields even greater results
  4. Coding agents are the first beneficiaries: The file system already exists as a natural external environment

Conclusion

RLM is not merely an academic idea. As Tenobrus’s experiment demonstrates, it’s a practical pattern you can implement in coding agents right now.

If you’re feeling the limits of single models, try the RLM pattern before building a multi-agent system. You’ll be surprised by how much more you can achieve with the same model.

From the perspective of building systems with AI, RLM embodies the essence of engineering — “boosting performance through architecture without changing the model.” Not waiting for bigger models, but building smarter structures with what we have — that’s the direction engineering managers should be watching.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.