AI Self-Generated Skills Are Useless — Research Debunking the LLM Self-Improvement Myth

AI Self-Generated Skills Are Useless — Research Debunking the LLM Self-Improvement Myth

SkillsBench proves AI agents cannot author useful skills for themselves. Across 7,308 trajectories, self-generated skills showed zero benefit while human-curated skills improved performance by 16.2pp.

Overview

“AI that makes itself better” — self-play and self-improvement paradigms are among the most compelling narratives in the AI industry. But a new study, SkillsBench (arXiv:2602.12670), directly challenges this myth.

Across 11 domains, 86 tasks, 7 agent-model configurations, and 7,308 trajectories, the large-scale experiment found:

  • Human-curated skills: average +16.2pp performance improvement
  • AI self-generated skills: zero benefit (0pp)

In other words, LLMs cannot reliably author the procedural knowledge they benefit from consuming.

What Are Agent Skills?

Agent Skills as defined in the research are structured packages of procedural knowledge injected at inference time into LLM agents.

Skills Package Structure
├── SKILL.md          # Procedural guide (workflows, SOPs)
├── scripts/          # Executable scripts
├── templates/        # Code templates
└── examples/         # Reference examples

Key differences from existing approaches:

TypeSystem PromptRAGFew-shotSkills
Structured
Procedural
Executable Resources
Portable

Modern agent tools like Claude Code’s CLAUDE.md, Gemini CLI, and Codex CLI have adopted this Skills concept.

Experimental Design: 3-Condition Comparison

SkillsBench evaluates identical tasks under three conditions:

graph LR
    A[Same Task] --> B[No Skills<br/>Baseline]
    A --> C[Curated Skills<br/>Human-Curated]
    A --> D[Self-Generated Skills<br/>AI-Generated]
    B --> E[Compare Results]
    C --> E
    D --> E

Experimental scale:

  • 11 domains (software engineering, data analysis, healthcare, etc.)
  • 86 tasks (selected from 322 candidates by 105 contributors)
  • 7 agent-model configurations (Claude Code, Gemini CLI, Codex CLI)
  • 7,308 trajectories (exhaustive evaluation)

All evaluations use deterministic verifiers for pass/fail judgment, eliminating LLM-as-judge bias.

Key Finding 1: Curated Skills Are Effective

Human-curated skills showed an average +16.2pp performance improvement. However, domain variance is extreme:

DomainPerformance Gain
Healthcare+51.9pp
Data AnalysisHigh improvement
Software Engineering+4.5pp
Some tasks (16/84)Negative

Notably, 16 out of 84 tasks showed performance degradation with skills. Skills are not a silver bullet.

Key Finding 2: Self-Generated Skills Are Useless

This is the most shocking result of the study.

When LLMs were asked to “write skills for yourself to better perform this task,” then used those skills:

“Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.”

The average effect of self-generated skills was 0pp. In some cases, they were actively harmful.

graph TD
    subgraph "Self-Improvement Myth"
        A[LLM Generates Skills] --> B[Apply Generated Skills]
        B --> C[Performance Improvement?]
        C -->|Actual Result| D[No Effect ❌]
    end
    subgraph "What Actually Works"
        E[Humans Curate Skills] --> F[Apply Curated Skills]
        F --> G[Average +16.2pp ✓]
    end

This is a powerful counterargument to self-play/self-improvement universalism. Models are proficient at consuming externally provided procedural knowledge but lack the ability to produce useful procedural knowledge.

Key Finding 3: Less Is More

Another important discovery concerns skill size:

Focused skills with 2–3 modules outperform comprehensive documentation

Small, focused skill packages boost performance more than extensive manuals. This likely relates to LLMs’ context window utilization efficiency.

Additionally, smaller models + skills ≈ larger models (without skills). A small model armed with proper skills can match the baseline performance of a larger model.

Practical Implications

The message for practitioners using AI agents is clear:

1. Reconsider Skill Auto-Generation Pipelines

The approach of “AI generates and improves its own skills” is currently ineffective. Human expert curation remains essential.

2. Keep Skills Small and Focused

Core skills with 2–3 modules are more effective than massive documentation. Writing a concise CLAUDE.md focused on key workflows beats hundreds of lines.

3. Recognize Domain-Specific Variance

The gap between healthcare (+51.9pp) and software engineering (+4.5pp) is over 10x. Skills have diminishing returns in domains where models already excel.

4. Acknowledge That Skills Can Be Harmful

Skills degraded performance in 16 out of 84 tasks. Bad skills are worse than no skills at all.

Technical Analysis: Why Self-Generation Fails

While the paper doesn’t provide direct causal analysis, we can infer structural reasons:

Metacognition limitations: LLMs cannot accurately assess “what they don’t know.” They lack the ability to diagnose which procedural knowledge they need.

General knowledge vs. procedural knowledge: LLM pre-training data is biased toward declarative knowledge. They learn “what” better than “how-to.”

Unverifiability: Models have no way to verify the quality of their self-generated skills. Curated skills undergo validation by human experts.

Conclusion

SkillsBench is the first systematic benchmark for AI agent skills, presenting cold data against the attractive narrative of “AI self-improvement.”

The core message is simple:

  • ✅ Human-created skills are effective (+16.2pp)
  • ❌ AI-created skills have no effect (0pp)
  • ✅ Small, focused skills outperform extensive documentation
  • ✅ Small model + good skills ≈ large model

The dream of self-improvement is compelling, but current LLMs have not yet reached that level. Human domain expertise and curation remain irreplaceable.

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.