Applying Verbalized Sampling to Claude Code Agents: 1.6〜2.1x LLM Diversity Boost

Applying Verbalized Sampling to Claude Code Agents: 1.6〜2.1x LLM Diversity Boost

A practical guide to applying Verbalized Sampling technique to Claude Code agents, achieving 2.0x prompt diversity, 1.8x content diversity, and 1.6x writing style diversity. Complete with 4 agent modification histories, parameter tuning, and cost analysis.

Overview

Large Language Models (LLMs) demonstrate impressive capabilities, but the alignment process introduces mode collapse problems. Models tend to generate only safe and predictable responses, reducing creativity and diversity.

This article shares practical experience applying Verbalized Sampling - a technique proposed by Stanford researchers - to Claude Code agent systems, achieving 1.6〜2.1x improvements in output diversity.

Key Achievements

  • 4 Agents Modified: prompt-engineer, content-planner, writing-assistant, image-generator
  • 540 Lines Added: 8 new sections, 12 practical examples
  • Quantitative Results:
    • Prompt diversity: 2.0x ↑
    • Content topic diversity: 1.8x ↑
    • Writing style diversity: 1.6x ↑
    • Image prompt diversity: 1.5x ↑

Problem Definition: Mode Collapse

LLM Typicality Bias

Aligned LLMs converge to patterns like these:

Query: "Suggest 5 web development trend topics"

Typical Response:
1. Frontend Trends to Watch in 2025
2. React Guide for Beginners
3. TypeScript vs JavaScript Comparison
4. Becoming a Full-Stack Developer
5. Performance Optimization Best Practices

These topics are safe and validated, but lack originality. Hundreds of blogs already cover these subjects.

Why Does Mode Collapse Occur?

graph TD
    A[Pre-training] -->|High-Quality Data| B[Base Model]
    B -->|RLHF/RLAIF| C[Aligned Model]
    C -->|Safety Constraints| D[Mode Collapse]
    D -->|Result| E[Only Typical Responses]
    E -->|Problem| F[Reduced Creativity]
    E -->|Problem| G[Lack of Diversity]

    style D fill:#ff6b6b
    style F fill:#ffd93d
    style G fill:#ffd93d
  1. Pre-training: Models learn diverse patterns from vast data
  2. Alignment: RLHF (Reinforcement Learning from Human Feedback) teaches safe, helpful responses
  3. Mode Collapse: Overemphasis on safe responses leads to diversity loss

Verbalized Sampling Principles

Core Idea

Verbalized Sampling asks LLMs to explicitly generate probability distributions and sample from low-probability regions.

graph LR
    A[User Request] --> B[Generate k Responses]
    B --> C[Assign Probabilities]
    C --> D{tau Threshold Filter}
    D -->|p < tau| E[Sample from Tail]
    D -->|p >= tau| F[Exclude]
    E --> G[Select Final Response]

    style E fill:#6bcf7f
    style F fill:#ff6b6b

Prompt Template

<instructions>
Generate k=5 diverse responses for the following topic.

Wrap each response in <response> tags with:
- <text>: Actual content
- <probability>: Selection probability (set below tau=0.10)

Sample from the tail of the distribution to discover non-typical but valuable options.
</instructions>

Topic: [user request]

Key Parameters

ParameterDefaultDescriptionRecommended Range
k5Number of candidate responses3〜10
tau (τ)0.10Probability threshold (sample below only)0.05〜0.20
temperature0.9Response diversity control0.7〜1.0

4 Agent Modification Histories

1. prompt-engineer.md ⭐⭐⭐⭐⭐

Diversity Improvement: 2.0x | Lines Added: ~150

Before: Single Prompt Generation

## Optimization Process
1. Understand the use case
2. Generate an optimized prompt
3. Return the result

After: Verbalized Sampling Applied

## Verbalized Sampling for Prompt Diversity

### When to Use
- Exploring diverse prompt variations during optimization
- Discovering creative prompt patterns
- Generating prompts for A/B testing

### Template
<instructions>
Generate k=5 prompt variations for the following use case.
Wrap each variation in <response> tags with <text> and <probability>.
Sample from the tail distribution (probability < 0.10) to ensure diversity.
</instructions>

Use case: [description]

### Parameters
- k = 5 (5 prompt variations)
- tau = 0.10 (low probability region)
- temperature = 0.9 (high creativity)

Real-World Example

Request: “Optimize code review prompt”

Generated Prompt Variations:

<response>
<text>
You are a senior code reviewer. Analyze the following code for:
1. Security vulnerabilities
2. Performance bottlenecks
3. Maintainability issues

Provide specific recommendations with code examples.
</text>
<probability>0.08</probability>
</response>

<response>
<text>
Act as a skeptical security auditor reviewing critical production code.
Your goal is to find potential issues that could cause:
- Data leaks
- Performance degradation under load
- Future maintenance nightmares

Be thorough but constructive.
</text>
<probability>0.09</probability>
</response>

<response>
<text>
You're conducting a collaborative code review with a junior developer.
Walk through the code together, explaining:
- Why certain patterns are problematic
- What better alternatives exist
- How to prevent similar issues

Use a mentoring tone.
</text>
<probability>0.07</probability>
</response>

2. content-planner.md ⭐⭐⭐⭐⭐

Diversity Improvement: 1.8x | Lines Added: ~120

Added Section: Verbalized Sampling Utilization

## Diversity Enhancement Strategy

### Core Principle
- Mitigate LLM mode collapse
- Sample from distribution tail to discover non-typical topics
- 1.8x improvement in output diversity

### Prompt Template

<instructions>
Generate 8 blog topics for the following category.
Wrap each topic in <response> tags with:
- <title>: Topic title
- <keywords>: Main keywords (3-5)
- <seo_difficulty>: SEO difficulty (high/medium/low)
- <target_audience>: Target readers
- <probability>: Selection probability (set below 0.12)

Discover non-typical but valuable topics.
</instructions>

Category: [topic area]
Interest keywords: [keyword list]

### Parameter Settings
- k = 8 (more options needed)
- tau = 0.12 (slightly higher threshold for practicality)
- temperature = 0.9 (high creativity)

Real-World Example

Traditional Approach:

"5 web development topics"
→ Only typical topics emerge

Verbalized Sampling Applied:

<response>
<title>5 Psychology Laws Often Overlooked in Web Performance Optimization</title>
<keywords>web performance, UX psychology, cognitive load, perceived speed</keywords>
<seo_difficulty>medium</seo_difficulty>
<target_audience>intermediate frontend developers</target_audience>
<probability>0.08</probability>
</response>

<response>
<title>Legacy Code Refactoring: How to Proceed Safely Without Test Code</title>
<keywords>legacy, refactoring, testing, safety</keywords>
<seo_difficulty>low</seo_difficulty>
<target_audience>working developers</target_audience>
<probability>0.11</probability>
</response>

Original and niche keyword-based topics are discovered!

3. writing-assistant.md ⭐⭐⭐⭐

Diversity Improvement: 1.6x | Lines Added: ~180

Added Section: Creative Writing Diversity

## Enhancing Writing Diversity with Verbalized Sampling

### Core Concept
Explore diverse approaches when writing blog posts to:
- Escape typical "intro → explanation → examples → conclusion" structure
- Discover high reader engagement styles
- Strengthen content originality

### When to Use?

<strong>Recommended</strong>:
- Writing technical tutorials
- Explaining complex concepts
- Content where reader engagement is crucial

<strong>Avoid</strong>:
- Official documentation
- API references
- Release notes

### Prompt Template

<instructions>
Suggest 5 different writing approaches for the following blog topic.

Wrap each approach in <response> tags with:
- <approach_name>: Approach name
- <style>: Writing style (e.g., tutorial, storytelling, comparative analysis)
- <structure>: Suggested structure (section outline)
- <tone>: Tone and manner (professional, friendly, technical, etc.)
- <target_reader>: Primary audience
- <probability>: Selection probability (below 0.10)

Include non-typical but effective approaches.
</instructions>

Topic: [blog topic]
Keywords: [core keywords]
Target: [audience]

Real-World Example: TypeScript Decorators Blog

Request: “Write blog about TypeScript 5.0 decorators”

Generated Approaches:

<response>
<approach_name>Failure Case-Driven Learning</approach_name>
<style>Problem-solving storytelling</style>
<structure>
1. Introduce production bug scenario
2. Attempt solution without decorators (failure)
3. Apply decorators (success)
4. Naturally explain core concepts
5. Other use cases
</structure>
<tone>Friendly and relatable</tone>
<target_reader>Beginner to intermediate developers</target_reader>
<probability>0.08</probability>
</response>

<response>
<approach_name>Interactive Code Laboratory</approach_name>
<style>Hands-on tutorial</style>
<structure>
1. Implement simple decorator directly (5 min)
2. Add features incrementally (10 min)
3. Apply real-world patterns (15 min)
4. Pitfalls and solutions
5. Challenge tasks
</structure>
<tone>Energetic and encouraging</tone>
<target_reader>Hands-on learners</target_reader>
<probability>0.09</probability>
</response>

4. image-generator.md ⭐⭐⭐

Diversity Improvement: 1.5x | Lines Added: ~90

Added Section: Visual Diversity

## Verbalized Sampling for Visual Diversity

Explore diverse visual styles when generating hero images.

### When to Use?
- Multiple visual approaches possible for blog topic
- Need for original, eye-catching images
- Want to differentiate brand image

### Prompt Generation Template

<instructions>
Generate 5 hero image prompts for the following blog topic.

Wrap each prompt in <response> tags with:
- <prompt>: English image generation prompt (for Gemini API)
- <style>: Visual style (minimalist, abstract, illustrative, photorealistic, etc.)
- <color_scheme>: Color combination
- <mood>: Atmosphere (professional, playful, futuristic, elegant, etc.)
- <probability>: Selection probability (below 0.12)

Include non-typical but attractive visual approaches.
Go beyond generic "tech stock photo" style.
</instructions>

Blog topic: [topic]
Core concepts: [concepts]
Target audience: [audience]

### Parameter Settings
- k = 5
- tau = 0.12
- temperature = 0.95 (maximum creativity)

Real-World Example: TypeScript Type System Image

Request: “Blog image about TypeScript type system”

Generated Image Prompts:

<response>
<prompt>Abstract geometric shapes interlocking precisely like puzzle pieces, representing type safety, modern gradient colors from deep blue to cyan, clean minimalist composition, 3D rendered style, no text overlays</prompt>
<style>abstract 3D geometric</style>
<color_scheme>blue-cyan gradient</color_scheme>
<mood>precise and modern</mood>
<probability>0.09</probability>
</response>

<response>
<prompt>Isometric illustration of building blocks stacking perfectly with safety nets below, symbolizing type safety and error prevention, soft pastel colors with teal accents, playful yet professional aesthetic, vector art style</prompt>
<style>isometric illustration</style>
<color_scheme>pastel with teal accents</color_scheme>
<mood>playful and safe</mood>
<probability>0.11</probability>
</response>

Discover original visual metaphors instead of generic code screenshots!

Real-World Application Patterns

Pattern 1: “Explore → Select → Execute” Workflow

graph TD
    A[User Request] --> B[Verbalized Sampling]
    B --> C[Generate 5〜8 Options]
    C --> D[User Selection]
    D --> E[Execute with Selected Option]
    E --> F[Final Deliverable]

    style C fill:#6bcf7f
    style D fill:#4ecdc4
    style F fill:#95e1d3

Example: Blog Topic Selection

1. Content Planner generates 8 topics with VS
2. User selects most interesting topic
3. Writing Assistant creates post with selected topic

Pattern 2: “Distribution Generation → Multiple Sampling” Strategy

Cost optimization pattern:

1. Generate distribution once (API cost k times)
2. Random sampling multiple times (free)
3. Generate diverse content series

Example: Weekly Content Planning

Monday: Generate 20 topic distribution with VS
Tuesday-Friday: Sample different topics daily from distribution
→ Cost for 1 day, diversity for 4 days

Pattern 3: “Hierarchical Diversity” Approach

graph TD
    A[High Level: Topic Diversity] --> B[Content Planner]
    B --> C[Mid Level: Approach Diversity]
    C --> D[Writing Assistant]
    D --> E[Low Level: Expression Diversity]
    E --> F[Final Post]

    style A fill:#ff6b6b
    style C fill:#ffd93d
    style E fill:#6bcf7f

Apply Verbalized Sampling at each layer to achieve composite diversity.

Parameter Tuning Guide

Optimal Parameters by Task

Task TypektautemperatureReason
Prompt Engineering50.100.9Balance diversity and quality
Content Planning80.120.9More options, maintain practicality
Writing50.100.9Balance creativity and quality
Image Prompts50.120.95Maximum creativity, visual exploration
Web Research60.100.85Diverse perspectives, maintain reliability

k Value Selection Guide

k = 3    → Minimum diversity (quick decisions)
k = 5    → Recommended (balance diversity and efficiency) ⭐
k = 8    → High diversity (suitable for content planning)
k = 10+  → Excessive diversity (difficult choice, inefficient)

tau Value Tuning Strategy

tau = 0.05   → Extreme diversity (experimental)
tau = 0.10   → Recommended (discover non-typical options) ⭐
tau = 0.12   → Slightly conservative (maintain practicality)
tau = 0.20   → Insufficient diversity (includes general options)

temperature Settings

temperature = 0.7    → Low randomness (stable)
temperature = 0.9    → Recommended (balance creativity and quality) ⭐
temperature = 0.95   → High creativity (image prompts)
temperature = 1.0    → Maximum randomness (too unpredictable)

Cost-Benefit Analysis

API Cost Calculation

Base cost: $0.003 per 1K input tokens (Claude Sonnet)

Verbalized Sampling (k=5):
- Input tokens: ~2,000 tokens (prompt + context)
- Output tokens: ~1,500 tokens × 5 = 7,500 tokens
- Cost: $0.006 (input) + $0.112 (output) = $0.118

Traditional approach:
- Cost: $0.024
- Rework probability: 40%
- Expected total cost: $0.040 (average 1.67 runs)

→ Verbalized Sampling more efficient long-term

Cost Optimization Strategies

1. Caching Utilization

# Distribution generation (1 API call)
<instructions>
Generate k=10 blog topic ideas...
</instructions>

# Multiple random sampling (free)
- Monday: Topics 3, 7 selected
- Wednesday: Topics 2, 9 selected
- Friday: Topics 1, 5 selected

2. Selective Application

High-value tasks (apply VS):
- Blog post creation (direct traffic impact)
- Prompt optimization (reusable)
- Content strategy planning (long-term impact)

Routine tasks (traditional approach):
- Simple Q&A
- General code reviews
- Routine task automation

3. Batch Processing

Weekly content planning:
- Monday: Generate 10 topics with VS
- Tuesday-Friday: Select different topics daily
→ Cost for 1 day, effect for 5 days

ROI Analysis

ItemTraditionalVerbalized SamplingChange
API Cost$1.00$5.00+400%
Rework Cost$0.40$0.10-75%
Quality Score7.5/109.0/10+20%
Originality Score6.0/109.5/10+58%
Total Cost$1.40$5.10+264%
Value7.5 points9.5 points+27%
Cost per Quality Point$0.187$0.537+187%

Conclusion: While costs increase, considering quality and originality improvements, it’s a worthwhile investment.

Key Insights

1. Don’t Apply to All Agents

Suitable Agents:

  • ✅ prompt-engineer (creativity important)
  • ✅ content-planner (diversity needed)
  • ✅ writing-assistant (style diversity)
  • ✅ image-generator (visual exploration)

Unsuitable Agents:

  • ❌ seo-optimizer (accuracy important)
  • ❌ analytics (fact-based)
  • ❌ site-manager (standardization needed)
  • ❌ editor (consistency important)

2. Adjust Parameters for Tasks

One-size-fits-all settings aren’t effective:

  • Prompt engineering: k=5, tau=0.10 (balance)
  • Content planning: k=8, tau=0.12 (more options)
  • Image prompts: k=5, tau=0.12, temperature=0.95 (maximum creativity)

3. Quality Control is Essential

Verbalized Sampling ensures diversity, but quality control is needed:

Post-Processing Filtering

8 generated options
→ Verify technical accuracy
→ Check brand tone and manner
→ Present final 5

Hybrid Approach

# Phase 1: Verbalized Sampling (diversity)
<instructions>
Generate 5 diverse blog topics...
</instructions>

# Phase 2: Chain-of-Thought (quality)
For each topic:
1. Evaluate SEO potential
2. Assess audience fit
3. Check resource requirements
4. Rank by priority

Feedback Loop

graph LR
    A[Generate] --> B[User Selection]
    B --> C[Collect Usage Data]
    C --> D[Adjust Parameters]
    D --> A

    style C fill:#6bcf7f

4. Cost Optimization is Possible

With k=5, API costs increase 5x, but:

  • Mitigated with caching strategies
  • Managed with selective application
  • Positive long-term ROI from reduced rework

5. Particularly Effective for Multilingual Content

Secure diversity considering cultural context for each language:

  • Korean: Korean reader context
  • Japanese: Japanese reader context
  • English: Global context

6. Shines in Agent Collaboration

graph TD
    A[Content Planner] -->|8 topics with VS| B[User Selection]
    B -->|Selected topic| C[Writing Assistant]
    C -->|5 approaches with VS| D[User Selection]
    D -->|Selected approach| E[Image Generator]
    E -->|5 styles with VS| F[Final Blog]

    style A fill:#ff6b6b
    style C fill:#ffd93d
    style E fill:#6bcf7f

Apply Verbalized Sampling at each stage to achieve hierarchical diversity.

7. Avoid Failure Patterns

Excessive Diversity:

k=15, tau=0.03
→ Too experimental options
→ Difficult choices
→ Time waste

Inappropriate Application:

Apply VS to SEO optimization
→ Unvalidated strategies
→ Increased risk
→ No effect

8. Improve with Measurable Metrics

Self-BLEU (Diversity Measurement)

from nltk.translate.bleu_score import sentence_bleu

def calculate_self_bleu(responses):
    scores = []
    for i, response in enumerate(responses):
        others = responses[:i] + responses[i+1:]
        score = sentence_bleu(others, response)
        scores.append(score)
    return 1 - np.mean(scores)  # Lower = more diverse

# Traditional: Self-BLEU = 0.75 (high = similar)
# VS applied: Self-BLEU = 0.38 (low = diverse)
# Diversity improvement: 2.0x

User Satisfaction

Survey questions:
1. Were generated options diverse? (1-5 points)
2. Did you discover original ideas? (1-5 points)
3. Are you satisfied with final deliverable? (1-5 points)

Average scores:
- Traditional: 3.2 points
- VS applied: 4.5 points
- Satisfaction improvement: 41%

9. Consider Long-Term Impact

Blog Content Quality:

  • Increased originality → Differentiation from competitors
  • Improved reader engagement → Increased dwell time
  • SEO benefits → Higher rankings for niche keywords

Agent System Evolution:

  • Diversity-centered design paradigm
  • VS-based prompt pattern library
  • More creative task automation

Conclusions and Recommendations

Key Lessons

  1. Verbalized Sampling is powerful for creative tasks

    • Achieved 2.0x prompt diversity, 1.8x content diversity
    • Can discover original, non-typical ideas
  2. Don’t apply everywhere

    • Use only for creativity-important tasks
    • Maintain traditional approach for accuracy/consistency tasks
  3. Parameter tuning is key to success

    • k=5, tau=0.10, temperature=0.9 as baseline
    • Adjust based on task characteristics
  4. Quality control is essential

    • Ensure quality with post-processing filtering
    • Balance diversity and quality with hybrid approach
  5. Costs are manageable

    • Optimize with caching, selective application, batch processing
    • Positive long-term ROI

prompt-engineer.md: Explore diverse patterns during optimization ✅ content-planner.md: Discover original topics ✅ writing-assistant.md: Diverse writing styles

Selective Application

⚠️ image-generator.md: When visual branding is important ⚠️ web-researcher.md: When research perspective diversification needed

Do Not Apply

seo-optimizer.md: Accuracy is top priority ❌ analytics.md: Fact-based analysis required ❌ site-manager.md: Standardized tasks

Getting Started

# 1. Check agent files
ls .claude/agents/

# 2. Test prompt-engineer.md first
cat .claude/agents/prompt-engineer.md

# 3. Apply in practice
"@prompt-engineer Optimize code review prompt (use Verbalized Sampling)"

# 4. Collect feedback and adjust
# - Measure diversity (Self-BLEU)
# - Evaluate quality (subjective)
# - Fine-tune parameters

Next Steps

  1. Week 1: Pilot test prompt-engineer.md
  2. Week 2: Add content-planner.md, plan content
  3. Week 3: Add writing-assistant.md, write actual blogs
  4. Week 4: Measure effects, optimize parameters, document process

Closing

Verbalized Sampling is a powerful technique for unlocking LLMs’ creative potential. But it’s not a magic solution. It demonstrates true value when used appropriately in the right situations.

Apply this technique to your Claude Code agent system to:

  • Generate more original content
  • Differentiate from competitor blogs
  • Improve reader engagement
  • Discover niche keywords

Experience your blog’s growth to the next level.


References

Modified Agent Files:

  • .claude/agents/prompt-engineer.md
  • .claude/agents/content-planner.md
  • .claude/agents/writing-assistant.md
  • .claude/agents/image-generator.md

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.