AI Self-Generated Skills Are Useless — Research Debunking the LLM Self-Improvement Myth
SkillsBench proves AI agents cannot author useful skills for themselves. Across 7,308 trajectories, self-generated skills showed zero benefit while human-curated skills improved performance by 16.2pp.
Overview
“AI that makes itself better” — self-play and self-improvement paradigms are among the most compelling narratives in the AI industry. But a new study, SkillsBench (arXiv:2602.12670), directly challenges this myth.
Across 11 domains, 86 tasks, 7 agent-model configurations, and 7,308 trajectories, the large-scale experiment found:
- Human-curated skills: average +16.2pp performance improvement
- AI self-generated skills: zero benefit (0pp)
In other words, LLMs cannot reliably author the procedural knowledge they benefit from consuming.
What Are Agent Skills?
Agent Skills as defined in the research are structured packages of procedural knowledge injected at inference time into LLM agents.
Skills Package Structure
├── SKILL.md # Procedural guide (workflows, SOPs)
├── scripts/ # Executable scripts
├── templates/ # Code templates
└── examples/ # Reference examples
Key differences from existing approaches:
| Type | System Prompt | RAG | Few-shot | Skills |
|---|---|---|---|---|
| Structured | ✗ | ✗ | ✗ | ✓ |
| Procedural | △ | ✗ | ✗ | ✓ |
| Executable Resources | ✗ | ✗ | ✗ | ✓ |
| Portable | ✗ | △ | △ | ✓ |
Modern agent tools like Claude Code’s CLAUDE.md, Gemini CLI, and Codex CLI have adopted this Skills concept.
Experimental Design: 3-Condition Comparison
SkillsBench evaluates identical tasks under three conditions:
graph LR
A[Same Task] --> B[No Skills<br/>Baseline]
A --> C[Curated Skills<br/>Human-Curated]
A --> D[Self-Generated Skills<br/>AI-Generated]
B --> E[Compare Results]
C --> E
D --> E
Experimental scale:
- 11 domains (software engineering, data analysis, healthcare, etc.)
- 86 tasks (selected from 322 candidates by 105 contributors)
- 7 agent-model configurations (Claude Code, Gemini CLI, Codex CLI)
- 7,308 trajectories (exhaustive evaluation)
All evaluations use deterministic verifiers for pass/fail judgment, eliminating LLM-as-judge bias.
Key Finding 1: Curated Skills Are Effective
Human-curated skills showed an average +16.2pp performance improvement. However, domain variance is extreme:
| Domain | Performance Gain |
|---|---|
| Healthcare | +51.9pp |
| Data Analysis | High improvement |
| Software Engineering | +4.5pp |
| Some tasks (16/84) | Negative |
Notably, 16 out of 84 tasks showed performance degradation with skills. Skills are not a silver bullet.
Key Finding 2: Self-Generated Skills Are Useless
This is the most shocking result of the study.
When LLMs were asked to “write skills for yourself to better perform this task,” then used those skills:
“Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.”
The average effect of self-generated skills was 0pp. In some cases, they were actively harmful.
graph TD
subgraph "Self-Improvement Myth"
A[LLM Generates Skills] --> B[Apply Generated Skills]
B --> C[Performance Improvement?]
C -->|Actual Result| D[No Effect ❌]
end
subgraph "What Actually Works"
E[Humans Curate Skills] --> F[Apply Curated Skills]
F --> G[Average +16.2pp ✓]
end
This is a powerful counterargument to self-play/self-improvement universalism. Models are proficient at consuming externally provided procedural knowledge but lack the ability to produce useful procedural knowledge.
Key Finding 3: Less Is More
Another important discovery concerns skill size:
Focused skills with 2–3 modules outperform comprehensive documentation
Small, focused skill packages boost performance more than extensive manuals. This likely relates to LLMs’ context window utilization efficiency.
Additionally, smaller models + skills ≈ larger models (without skills). A small model armed with proper skills can match the baseline performance of a larger model.
Practical Implications
The message for practitioners using AI agents is clear:
1. Reconsider Skill Auto-Generation Pipelines
The approach of “AI generates and improves its own skills” is currently ineffective. Human expert curation remains essential.
2. Keep Skills Small and Focused
Core skills with 2–3 modules are more effective than massive documentation. Writing a concise CLAUDE.md focused on key workflows beats hundreds of lines.
3. Recognize Domain-Specific Variance
The gap between healthcare (+51.9pp) and software engineering (+4.5pp) is over 10x. Skills have diminishing returns in domains where models already excel.
4. Acknowledge That Skills Can Be Harmful
Skills degraded performance in 16 out of 84 tasks. Bad skills are worse than no skills at all.
Technical Analysis: Why Self-Generation Fails
While the paper doesn’t provide direct causal analysis, we can infer structural reasons:
Metacognition limitations: LLMs cannot accurately assess “what they don’t know.” They lack the ability to diagnose which procedural knowledge they need.
General knowledge vs. procedural knowledge: LLM pre-training data is biased toward declarative knowledge. They learn “what” better than “how-to.”
Unverifiability: Models have no way to verify the quality of their self-generated skills. Curated skills undergo validation by human experts.
Conclusion
SkillsBench is the first systematic benchmark for AI agent skills, presenting cold data against the attractive narrative of “AI self-improvement.”
The core message is simple:
- ✅ Human-created skills are effective (+16.2pp)
- ❌ AI-created skills have no effect (0pp)
- ✅ Small, focused skills outperform extensive documentation
- ✅ Small model + good skills ≈ large model
The dream of self-improvement is compelling, but current LLMs have not yet reached that level. Human domain expertise and curation remain irreplaceable.
References
- SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks — Xiangyi Li et al., 2026
- Anthropic Claude Code Skills Documentation
- Harbor Framework — Agent benchmark framework
Was this helpful?
Your support helps me create better content. Buy me a coffee! ☕