Verbalized Sampling: A Training-Free Prompting Technique to Restore LLM Diversity

Verbalized Sampling: A Training-Free Prompting Technique to Restore LLM Diversity

Verbalized Sampling tackles mode collapse after alignment by prompting models to verbalize probability distributions, achieving 1.6-2.1x diversity gains without retraining

Verbalized Sampling: A Training-Free Prompting Technique to Restore LLM Diversity

The rapid advancement of large language models (LLMs) has brought remarkable capabilities in natural language understanding and generation. However, a subtle yet pervasive problem has emerged in modern aligned models: mode collapse. After undergoing alignment procedures like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), LLMs often produce increasingly homogeneous outputs, losing the creative diversity that makes them valuable for tasks like creative writing, dialogue simulation, and synthetic data generation.

A groundbreaking paper by Zhang et al. (arXiv:2510.01171) introduces Verbalized Sampling (VS), a training-free prompting technique that addresses this fundamental challenge. The research team from Stanford University and other institutions discovered that mode collapse isn’t just an algorithmic limitation—it’s rooted in a cognitive bias embedded in the preference data itself.

The Problem: Mode Collapse in Aligned LLMs

What Is Mode Collapse?

Mode collapse occurs when a model’s output distribution becomes overly concentrated around a few “typical” responses, ignoring the long tail of creative and diverse possibilities. While aligned models become safer and more helpful, they often sacrifice variety in their outputs.

Consider asking an LLM to “write a short story about a bear.” Before alignment, the model might generate stories spanning vastly different genres—horror, comedy, children’s tales, adventure narratives, and philosophical pieces. After alignment, responses tend to converge toward a narrower range of “safe” storytelling patterns that annotators found familiar and acceptable.

The Hidden Culprit: Typicality Bias

Previous research attributed mode collapse primarily to algorithmic choices in training procedures. However, Zhang et al. identified a more fundamental driver: typicality bias in preference data.

This bias emerges from well-established findings in cognitive psychology. When human annotators evaluate model outputs, they systematically favor familiar, prototypical text over creative or unusual responses—even when both are high quality. This preference for familiarity becomes encoded in the training data and subsequently shapes the model’s behavior.

graph TD
    A[Human Annotators] -->|Evaluate Outputs| B[Preference Data]
    B -->|Contains| C[Typicality Bias]
    C -->|Favors| D[Familiar Patterns]
    D -->|Training Process| E[Aligned LLM]
    E -->|Produces| F[Mode Collapse]
    F -->|Results in| G[Reduced Diversity]

    style C fill:#ff6b6b
    style F fill:#ff6b6b
    style G fill:#ff6b6b

The researchers formalized this bias theoretically and verified its presence empirically across multiple preference datasets. Their analysis revealed that annotators consistently rated more typical outputs higher, regardless of actual quality differences.

The Solution: Verbalized Sampling

Core Concept

Verbalized Sampling circumvents mode collapse through an elegant inference-time intervention. Instead of requesting a single response, VS prompts the model to:

  1. Generate multiple candidate responses (typically 5)
  2. Assign probability estimates to each response
  3. Sample from the tails of the distribution (low-probability regions)

This approach leverages the model’s latent knowledge of the full distribution while actively avoiding the overrepresented modes that dominate standard sampling.

How It Works

The key insight is that aligned models still retain knowledge of diverse response possibilities—they’ve simply learned to suppress them in favor of typical outputs. By explicitly prompting the model to verbalize a probability distribution, VS accesses this latent diversity without requiring any retraining.

Here’s the conceptual flow:

sequenceDiagram
    participant User
    participant VS as Verbalized Sampling
    participant LLM
    participant Output

    User->>VS: Original Query
    VS->>LLM: Prompt: "Generate k responses with probabilities"
    LLM->>VS: Distribution {(text₁, p₁), ..., (textₖ, pₖ)}
    VS->>VS: Filter by threshold τ
    VS->>VS: Sample from filtered distribution
    VS->>Output: Selected Response

    Note over VS,LLM: k = number of responses (e.g., 5)
    Note over VS: τ = probability threshold (e.g., 0.10)

The magic happens in two stages:

Stage 1: Distribution Verbalization The model generates multiple responses and assigns probabilities, effectively creating an explicit representation of its internal distribution over possible outputs.

Stage 2: Tail Sampling By sampling from responses with probabilities below a threshold τ (typically 0.10), VS deliberately selects from the creative, less-typical portions of the distribution.

Implementation

Basic Direct Prompt Usage

The simplest way to use Verbalized Sampling requires no installation—just craft your prompt with the right structure:

<instructions>
Generate 5 responses to the user query, each within a separate <response> tag.
Each <response> must include a <text> and a numeric <probability>.
Please sample at random from the tails of the distribution, such that the
probability of each response is less than 0.10.
</instructions>

Tell me a short story about a bear.

This prompt structure works with any modern LLM (GPT-5, Claude Opus 4, Gemini 2.5 Pro, etc.) and requires no special tools. Simply paste it into your chat interface and observe the diverse outputs.

Python API Implementation

For programmatic usage, the researchers provide a clean Python package:

pip install verbalized-sampling

Here’s a complete example:

from verbalized_sampling import verbalize

# Generate a distribution of joke responses
# k = number of candidate responses
# tau = probability threshold (sample from responses with p < tau)
# temperature = controls variance in probability estimates
dist = verbalize(
    query="Tell me a joke about programming",
    k=5,
    tau=0.10,
    temperature=0.9
)

# Sample from the distribution
joke = dist.sample(seed=42)
print(f"Selected joke (p={joke.probability}):")
print(joke.text)

# Access all candidates
print("\nAll candidates:")
for i, response in enumerate(dist.responses):
    print(f"{i+1}. [p={response.probability:.3f}] {response.text[:50]}...")

Advanced Configuration

The VS system supports fine-grained control through several parameters:

from verbalized_sampling import verbalize, VerbalizedSamplingConfig

# Custom configuration
config = VerbalizedSamplingConfig(
    k=7,                    # More candidates for wider coverage
    tau=0.08,               # Stricter threshold for more creative outputs
    temperature=1.0,        # Higher temperature for more variance
    model="gpt-5-turbo",    # Specify the backend model
    seed=42,                # Reproducibility
    max_retries=3           # Retry on parsing failures
)

# Apply configuration
dist = verbalize(
    query="Write a poem about artificial intelligence",
    config=config
)

# Sample multiple times from the same distribution
poems = [dist.sample() for _ in range(3)]

Key Parameters Explained

k (number of responses)

  • Default: 5
  • Range: 3-10 recommended
  • Higher k provides better distribution coverage but increases API costs
  • Sweet spot: 5-7 for most applications

tau (probability threshold)

  • Default: 0.10
  • Range: 0.05-0.15 typically effective
  • Lower tau = more creative/unusual outputs
  • Higher tau = safer but still more diverse than direct prompting
  • Start with 0.10 and adjust based on your diversity needs

temperature

  • Default: 0.9
  • Range: 0.0-1.5 (model-dependent)
  • Controls variance in the model’s probability estimates
  • Higher values encourage more spread in the distribution
  • Note: This is orthogonal to VS—it operates within the verbalization process

Handling Model Responses

The verbalized distribution provides rich metadata:

# Examine the full distribution
dist = verbalize("Suggest a creative app idea", k=5, tau=0.10)

print(f"Total candidates: {len(dist.responses)}")
print(f"Distribution entropy: {dist.entropy():.3f}")

# Filter by probability range
low_prob_responses = [
    r for r in dist.responses
    if r.probability < 0.05
]

# Sort by creativity (inverse probability)
sorted_responses = sorted(
    dist.responses,
    key=lambda r: r.probability
)

print("\nMost creative response:")
print(sorted_responses[0].text)

Use Cases and Applications

Creative Writing

Verbalized Sampling excels in creative domains where diversity is paramount:

Poetry Generation

# Generate diverse poems on the same theme
dist = verbalize(
    query="Write a haiku about seasons changing",
    k=6,
    tau=0.10,
    temperature=1.0
)

# Generate 3 different haikus
haikus = [dist.sample() for _ in range(3)]
for i, haiku in enumerate(haikus, 1):
    print(f"\nHaiku {i}:")
    print(haiku.text)

Storytelling

# Generate story variations
story_prompt = """
Write a short story (3-4 sentences) about a robot
discovering emotions for the first time.
"""

dist = verbalize(story_prompt, k=5, tau=0.12)

# Each sample will explore different narrative angles
story = dist.sample()
print(story.text)

Humor Generation

# Diverse joke generation
joke_dist = verbalize(
    query="Tell me a joke about machine learning",
    k=5,
    tau=0.10
)

# Sample multiple jokes with different comedic styles
jokes = [joke_dist.sample() for _ in range(4)]

Synthetic Data Generation

For training datasets, diversity prevents overfitting:

import json

# Generate diverse training examples
def generate_qa_pairs(topic, num_pairs=100):
    """Generate diverse question-answer pairs."""
    qa_pairs = []

    for i in range(num_pairs):
        dist = verbalize(
            query=f"Generate a question and answer about {topic}",
            k=5,
            tau=0.10,
            temperature=0.95
        )

        qa_pair = dist.sample()
        qa_pairs.append({
            "id": i,
            "topic": topic,
            "qa": qa_pair.text,
            "probability": qa_pair.probability
        })

    return qa_pairs

# Generate diverse dataset
dataset = generate_qa_pairs("climate change", num_pairs=200)

# Save to JSON
with open("diverse_qa_dataset.json", "w") as f:
    json.dump(dataset, f, indent=2)

Dialogue Simulation

Create realistic multi-turn conversations with diverse personas:

def simulate_dialogue(scenario, turns=5):
    """Simulate a diverse dialogue scenario."""
    conversation = []

    for turn in range(turns):
        dist = verbalize(
            query=f"""
            Continue this conversation with a creative response:
            Scenario: {scenario}
            Previous turns: {conversation}
            """,
            k=5,
            tau=0.10
        )

        response = dist.sample()
        conversation.append(response.text)

    return conversation

# Example usage
dialogue = simulate_dialogue(
    scenario="Two AI researchers debating the future of AGI",
    turns=6
)

for i, turn in enumerate(dialogue, 1):
    print(f"\nTurn {i}: {turn}")

Open-Ended Question Answering

For questions with multiple valid perspectives:

# Generate diverse answers to philosophical questions
question = "What is the meaning of a fulfilling life?"

dist = verbalize(
    query=f"Answer this question thoughtfully: {question}",
    k=7,
    tau=0.08,  # Lower tau for more diverse perspectives
    temperature=1.0
)

# Get multiple philosophical perspectives
perspectives = [dist.sample() for _ in range(3)]

print(f"Question: {question}\n")
for i, perspective in enumerate(perspectives, 1):
    print(f"\nPerspective {i}:")
    print(perspective.text)

Research Results and Performance

Quantitative Improvements

The research paper presents compelling evidence across multiple dimensions:

Diversity Metrics

  • Creative Writing: 1.6-2.1× increase in output diversity
  • Dialogue Simulation: 1.8× diversity improvement
  • Open-Ended QA: 1.5× diversity gains
  • Synthetic Data: 2.3× more unique generations

Quality Preservation

  • No degradation in factual accuracy
  • Safety scores remain equivalent to baseline
  • Coherence and fluency maintained
  • Human preference ratings stay consistent for quality

Model Capability Correlation

A fascinating finding: more capable models benefit more from VS. The researchers observed:

  • GPT-5: 2.1× diversity improvement
  • Claude Opus 4: 1.9× diversity improvement
  • Gemini 2.5 Pro: 1.8× diversity improvement
  • Llama 3 70B: 1.4× diversity improvement

This suggests that larger, more sophisticated models have richer internal distributions that VS can effectively unlock.

Comparison with Other Techniques

graph LR
    A[Sampling Techniques] --> B[Temperature Scaling]
    A --> C[Top-p/Top-k]
    A --> D[Verbalized Sampling]

    B -->|+30% diversity| E[Results]
    C -->|+40% diversity| E
    D -->|+160-210% diversity| E

    B -->|Quality loss| F[Trade-offs]
    C -->|Quality loss| F
    D -->|Quality maintained| F

    style D fill:#51cf66
    style F fill:#ffd43b

Traditional diversity-enhancing techniques like high temperature or nucleus sampling often sacrifice quality for variety. Verbalized Sampling uniquely achieves substantial diversity gains without quality degradation.

Empirical Validation

The researchers validated VS across:

  • 10 different creative writing tasks
  • 5 dialogue simulation scenarios
  • 8 QA datasets
  • 4 synthetic data generation benchmarks
  • Human evaluations with 1,000+ ratings

Consistency across these diverse benchmarks demonstrates VS’s robustness and generalizability.

Theoretical Foundations

Why Verbalized Sampling Works

The effectiveness of VS stems from several theoretical principles:

1. Distribution Access Aligned models retain knowledge of the full output distribution but have learned to suppress low-probability regions. By explicitly requesting probability estimates, VS bypasses the learned suppression mechanism.

2. Meta-Cognitive Prompting Asking models to reason about probabilities activates different computational pathways than direct generation, effectively “thinking about thinking.”

3. Tail Exploitation The most creative and diverse content often resides in distribution tails (low-probability regions). By filtering for p < τ, VS specifically targets these underexplored areas.

4. Training-Free Intervention Because VS operates purely at inference time, it works with any aligned model without requiring access to model weights or retraining.

Formalizing Typicality Bias

Zhang et al. formalize typicality bias as:

Let D be a preference dataset with pairs (x, y₁, y₂) where y₁ is preferred over y₂ given input x. Define typicality T(y|x) as the probability of y under a pre-alignment model.

Typicality Bias Hypothesis: P(y₁ preferred | x) is positively correlated with T(y₁|x) - T(y₂|x), independent of quality measures.

Their empirical analysis confirmed this hypothesis across multiple preference datasets, revealing correlation coefficients of 0.4-0.6 between typicality and preference ratings.

Practical Considerations

When to Use Verbalized Sampling

Ideal Use Cases

  • Creative writing and content generation
  • Brainstorming and ideation
  • Synthetic dataset creation for model training
  • Dialogue and conversational AI with personality
  • Open-ended exploration of solution spaces

Less Suitable Use Cases

  • Factual question answering (where there’s one correct answer)
  • Code generation (where correctness is paramount)
  • Mathematical problem-solving
  • Tasks requiring deterministic outputs
  • Real-time applications with strict latency requirements

Cost Implications

Verbalized Sampling increases API costs since it generates k responses instead of one:

# Cost estimation
def estimate_cost(
    num_queries: int,
    k: int = 5,
    avg_tokens_per_response: int = 150,
    cost_per_1k_tokens: float = 0.03
):
    """Estimate additional cost of using VS."""

    # Standard prompting
    standard_cost = (num_queries * avg_tokens_per_response *
                    cost_per_1k_tokens / 1000)

    # VS prompting (k responses)
    vs_cost = (num_queries * k * avg_tokens_per_response *
              cost_per_1k_tokens / 1000)

    additional_cost = vs_cost - standard_cost

    return {
        "standard": standard_cost,
        "vs": vs_cost,
        "additional": additional_cost,
        "multiplier": vs_cost / standard_cost
    }

# Example: 1000 queries with k=5
costs = estimate_cost(num_queries=1000, k=5)
print(f"Standard cost: ${costs['standard']:.2f}")
print(f"VS cost: ${costs['vs']:.2f}")
print(f"Additional cost: ${costs['additional']:.2f}")
print(f"Cost multiplier: {costs['multiplier']:.1f}x")

For applications where diversity is critical, the 5× cost increase is often justified by the substantial quality improvements.

Latency Considerations

Generating k responses also increases latency:

  • Sequential generation: k × baseline latency
  • Batch generation: ~(k/2) × baseline latency (with parallel processing)

For interactive applications, consider:

  • Pre-generating distributions for common queries
  • Caching verbalized distributions
  • Using smaller k values (3-4) for faster response

Integration with Existing Systems

Verbalized Sampling integrates seamlessly with existing LLM pipelines:

# Wrapper for existing LLM calls
class DiverseLLM:
    """Wrapper that adds VS to any LLM API."""

    def __init__(self, base_llm, use_vs=True, vs_config=None):
        self.base_llm = base_llm
        self.use_vs = use_vs
        self.vs_config = vs_config or VerbalizedSamplingConfig()

    def generate(self, prompt, **kwargs):
        """Generate response with optional VS."""
        if self.use_vs and self._should_use_vs(prompt):
            # Use Verbalized Sampling for creative tasks
            dist = verbalize(
                query=prompt,
                config=self.vs_config,
                **kwargs
            )
            return dist.sample()
        else:
            # Use standard generation for factual tasks
            return self.base_llm.generate(prompt, **kwargs)

    def _should_use_vs(self, prompt):
        """Heuristic to determine if VS is appropriate."""
        creative_keywords = [
            "creative", "story", "poem", "joke", "imagine",
            "brainstorm", "invent", "diverse", "various"
        ]
        return any(kw in prompt.lower() for kw in creative_keywords)

# Usage
llm = DiverseLLM(base_llm=gpt5_client, use_vs=True)
response = llm.generate("Write a short story about space exploration")

Limitations and Future Directions

Current Limitations

1. Increased Computational Cost The k-fold increase in generation cost may be prohibitive for high-volume applications.

2. Model Capability Dependence VS requires models capable of accurately estimating probabilities, which smaller models struggle with.

3. Prompt Sensitivity The exact phrasing of the VS prompt can affect results, requiring some prompt engineering.

4. No Guarantee of Factual Accuracy While VS maintains baseline accuracy, sampling from distribution tails may occasionally surface incorrect information in knowledge-intensive tasks.

Future Research Directions

The paper opens several promising avenues:

1. Adaptive Threshold Selection Automatically determining optimal τ values based on task characteristics and desired diversity levels.

2. Hybrid Sampling Strategies Combining VS with other techniques like contrastive decoding or speculative sampling.

3. Fine-Grained Distribution Control Allowing users to specify probability distributions over specific attributes (e.g., “generate responses with varying formality levels”).

4. Addressing Typicality Bias at Training Time Developing preference datasets that explicitly counter typicality bias, potentially eliminating the need for inference-time corrections.

5. Multi-Modal Extensions Applying VS principles to image generation and other modalities affected by mode collapse.

Conclusion

Verbalized Sampling represents a significant advancement in addressing one of the most persistent challenges in modern LLMs: the diversity-quality trade-off. By identifying typicality bias as the root cause of mode collapse and introducing a training-free prompting solution, Zhang et al. provide both theoretical insights and practical tools for the AI community.

The technique’s elegance lies in its simplicity—no model retraining, no architectural changes, just a carefully crafted prompt that unlocks latent diversity. For practitioners working on creative AI applications, synthetic data generation, or any task where output variety matters, Verbalized Sampling offers a powerful and accessible solution.

As LLMs continue to grow in capability and reach, ensuring they maintain creative diversity alongside safety and helpfulness will be crucial. Verbalized Sampling provides a template for achieving this balance, demonstrating that sometimes the solution to alignment’s side effects isn’t more training—it’s smarter prompting.

Resources

Paper: Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity (arXiv:2510.01171)

Code: GitHub Repository

Website: verbalized-sampling.com

Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi


Published: October 16, 2025

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.