Does AGENTS.md Actually Work? The First Empirical Study Reveals Surprising Results
The first empirical study evaluating AGENTS.md effectiveness has been published. We analyze its impact on coding agent success rates and inference costs.
Overview
As coding agents like Cursor, Claude Code, and Codex continue to spread, the practice of placing AGENTS.md (or CLAUDE.md, CURSOR.md) files in repositories to provide project context to agents has rapidly gained traction. Currently, over 60,000 repositories on GitHub alone include such files.
But does this file actually improve agent task success rates? A research team from ETH Zürich has published the first empirical answer to this question.
📄 Paper: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (arXiv 2602.11988, February 2026)
Key Findings: Results That Defy Expectations
LLM-Generated Context Files Actually Reduce Success Rates
The research team evaluated coding agents in three settings:
- No context file (baseline)
- LLM-generated context file (agent developer recommended approach)
- Developer-written context file
┌─────────────────────────────────────────────┐
│ Average Success Rate Change by Setting │
├─────────────────────────────────────────────┤
│ No context (baseline) : ████████ base │
│ LLM-generated context : ██████▌ -3% │
│ Developer-written context : ████████▌ +4% │
└─────────────────────────────────────────────┘
Key numbers at a glance:
- LLM-generated files: Average success rate decreased by 3%
- Developer-written files: Average success rate increased by 4% (marginal improvement)
- Inference cost: Increased by over 20% in both cases
Why Did This Happen?
The research team conducted a detailed analysis of agent behavior patterns:
graph TD
A[AGENTS.md Provided] --> B[Broader Exploration]
B --> C[More File Traversal]
B --> D[More Test Execution]
B --> E[Longer Reasoning Chains]
C --> F[Attempting to Follow Unnecessary Requirements]
D --> F
E --> F
F --> G[Increased Cost + Reduced Success Rate]
Agents tended to faithfully follow the instructions in context files. The problem was that many of those instructions were unnecessary requirements for the given task. Directives like following style guides and using specific test patterns actually made tasks more complex.
AGENTbench: A New Benchmark
The research team built a new benchmark called AGENTbench for this evaluation.
| Item | Details |
|---|---|
| Instances | 138 |
| Target Repositories | 12 (repos where developers actively use context files) |
| Task Types | Bug fixes + Feature additions |
| Complementary Benchmark | SWE-bench Lite (for popular repositories) |
Existing SWE-bench focused on well-known large repositories that didn’t include AGENTS.md. AGENTbench is the first benchmark to collect tasks from repositories that actually use context files.
Practical Implications: How Should You Use AGENTS.md?
❌ What NOT to Do
- Let LLMs auto-generate AGENTS.md via
/initcommands - Cram all project rules, style guides, and architecture descriptions into one file
- Provide extensive context expecting the agent to “read it all”
✅ What TO Do
The research team’s recommendation is clear: “Describe only minimal requirements”
Principles for effective AGENTS.md:
- Specify only build/test commands (e.g.,
npm test,pytest) - Document only project-specific tooling
- Keep style guides and architecture descriptions in separate documents
- Include only information directly needed for the agent’s task
# Good AGENTS.md Example
## Build
npm install && npm run build
## Test
npm test # All tests
npm test -- --grep "pattern" # Specific tests
## Lint
npm run lint # Required before commit
# Bad AGENTS.md Example (Excessive requirements)
## Architecture
This project follows clean architecture...
(Verbose explanation, 200 lines)
## Coding Style
All functions must include JSDoc comments...
Variable names must use camelCase...
(Detailed rules, 100 lines)
## Commit Rules
Follow Conventional Commits...
Developer Community Response
The paper received 58 points on Hacker News and sparked active discussion. Key reactions include:
- “Intuitively correct results”: Experiential agreement that excessive instructions confuse agents
- “Context window waste”: Concerns that long AGENTS.md files displace actual code context
- “Minimal is best”: Shared practical experience that build/test commands alone are sufficient
Limitations and Future Outlook
This study has several limitations:
- Python-centric: AGENTbench only covers Python projects
- Niche repositories: Repositories using context files are relatively small-scale
- Static evaluation: Whether context files have cumulative effects in repeated tasks is untested
Future research directions include:
- Adaptive context: Dynamically providing only necessary information based on task type
- Structured context: Using machine-parseable formats instead of free text
- Multi-language expansion: Verifying effectiveness beyond Python
Conclusion
AGENTS.md is becoming a de facto standard in the coding agent ecosystem, but this paper has challenged the assumption that “more is better”.
The core message is simple:
Keep context files minimal, focused on build and test commands.
Auto-generating via /init as recommended by agent developers may actually backfire at this point. Writing them manually with only essential information is the most effective strategy.
References
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕