Analyzing Blog Revisit Intent with SSR Methodology

Analyzing Blog Revisit Intent with SSR Methodology

Experimental results and statistical analysis of 225 evaluations using LLM-based Semantic Similarity Rating. Validated high reliability with ICC 0.83 and visualizations.

Overview

Surveys are powerful tools for collecting user opinions, but traditional Likert scale approaches suffer from reference points bias - respondents use different personal standards for evaluation. What one person considers “neutral” might be “good” to another.

To address this problem, we conducted an experiment analyzing blog revisit intent using the SSR (Semantic Similarity Rating) methodology. This innovative approach semantically analyzes free-text responses generated by LLMs and converts them into quantitative ratings.

Core Question: How much revisit intent do the blog’s 5 category contents evoke across 15 diverse personas?

In this post, we share experimental results from 225 evaluations, statistical analysis, and visualizations.

What is SSR Methodology?

SSR is a methodology proposed in the paper “LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings” published on arXiv in October 2024.

Problems with Traditional LLM Surveys

Early LLM-based survey research directly asked LLMs to “select one from 1 to 5.” However, this approach had serious problems:

Problem 1: Unrealistic Distribution

  • Actual humans: Near-normal distribution (concentrated on 2, 3, 4)
  • LLM direct evaluation: Extreme distribution (excessive concentration on 1 or 5)

Problem 2: Lack of Consistency

  • Responses vary significantly even when repeating the same question
  • Test-Retest reliability < 0.60

Problem 3: Loss of Context

  • Cannot understand “why” they evaluated that way
  • Absence of qualitative insights

SSR’s Innovative Solution

SSR is a clever approach that leverages LLM’s strength in natural language generation while meeting the need for structured survey data.

graph TB
    subgraph Traditional["Traditional LLM Survey"]
        Q1[Question: Revisit intent?] --> A1["LLM direct response: 4<br/>(Unrealistic distribution)"]
        A1 --> R1[Rating 4.0]
    end

    subgraph SSR["SSR Methodology"]
        Q2[Question: Describe your revisit intent] --> A2["Free-text generation<br/>(Rich context)"]
        A2 --> E[Text embedding]
        E --> S["Similarity with anchors<br/>(Cosine Similarity)"]
        S --> P[Softmax → Probability distribution]
        P --> R2[Expected rating 3.078]
    end

5-Step Algorithm

Step 1: Free-Text Response Generation

prompt = """You are {persona_name}, a {occupation}.
Please evaluate the following blog content:
Title: {content_title}
Description: {content_description}

Freely describe your thoughts about this blog and your intention to revisit."""

response = await openai_client.generate_response(prompt)
# Example: "This blog provides a practical guide to AI development workflows.
#           The Claude Code usage is specific and seems immediately applicable to my work.
#           I would like to visit again and read other articles."

Step 2: Response Embedding

response_embedding = await openai_client.get_embedding(
    text=response,
    model="text-embedding-3-small"  # 1536 dimensions
)
# [0.023, -0.145, 0.089, ...] (1536-dimensional vector)

Step 3: Calculate Cosine Similarity with Anchors

ANCHORS = {
    1: "This blog does not meet my expectations at all, and I will not revisit.",
    2: "This blog has some value, but I probably won't revisit.",
    3: "This blog is okay, but I'm not sure if I'll revisit.",
    4: "This blog provides useful information, so I'm likely to revisit.",
    5: "This blog is excellent, and I will revisit regularly."
}

# Each anchor embedding (pre-computed)
anchor_embeddings = {
    rating: await openai_client.get_embedding(text)
    for rating, text in ANCHORS.items()
}

# Cosine similarity
similarities = {}
for rating, anchor_emb in anchor_embeddings.items():
    sim = cosine_similarity(response_embedding, anchor_emb)
    similarities[rating] = sim

# Example: {1: 0.12, 2: 0.25, 3: 0.45, 4: 0.78, 5: 0.62}

Step 4: Generate Probability Distribution with Softmax

def softmax(similarities, temperature=1.0):
    """Convert cosine similarities to probability distribution"""
    values = np.array(list(similarities.values()))
    exp_values = np.exp(values / temperature)
    return exp_values / exp_values.sum()

probabilities = softmax(similarities)
# [0.05, 0.10, 0.20, 0.45, 0.20]  # Probability for each rating

Step 5: Calculate Expected Value

ratings = [1, 2, 3, 4, 5]
expected_rating = sum(r * p for r, p in zip(ratings, probabilities))
# 3.65 = 1×0.05 + 2×0.10 + 3×0.20 + 4×0.45 + 5×0.20

Advantages of SSR

1. Semantic Consistency

  • Eliminates differences in respondents’ subjective scales
  • All evaluations performed in the same embedding space

2. Rich Context Preservation

  • Provides quantitative rating + qualitative explanation simultaneously
  • Understand “why” they evaluated that way

3. High Reliability

  • Test-Retest reliability: Achieves 90% of human level
  • KS similarity > 0.85

4. Cost Efficiency

  • About $0.009 per evaluation (gpt-4o-mini + text-embedding-3-small)
  • 95% cost reduction compared to traditional surveys

Experimental Design

Persona Composition (15 people)

We generated personas from diverse countries and occupations:

NameCountryOccupation
Alex JohnsonUSASenior Software Engineer
김서연South KoreaAI Researcher
田中健太 (Tanaka Kenta)JapanData Engineer
Hans MüllerGermanyML Engineer
Priya SharmaIndiaData Analyst
Carlos SantosBrazilBackend Developer
Emily RobertsUKProduct Manager
Sophie TremblayCanadaDevOps Engineer
Wei ZhangSingaporeAI Product Developer
Pierre DuboisFranceData Scientist
Olivia ChenAustraliaUX Researcher
Dr. Michael LeeUSAAI Researcher
Lars van der BergNetherlandsSoftware Architect
박지훈South KoreaFull-Stack Developer
Li WeiChinaAI Student

Each persona includes the following information:

  • Demographics: Age, country, occupation
  • Interests: AI/ML, data engineering, web development, etc.
  • Search queries: Actual search terms based on Google Search Console
  • Tech stack: Python, JavaScript, React, Docker, etc.

Evaluated Contents (5 items)

  1. Claude Code Best Practices - AI-Powered Development Workflow
  2. Data Mesh vs Data Warehouse - Architectural Decision Framework
  3. Google Analytics MCP - Automating Google Analytics with Model Context Protocol
  4. Screenshot to Code - AI-Powered Screenshot to Code Tools and Techniques
  5. GraphRAG and Multi-Agent Systems - Advanced AI Architecture

Experimental Setup

  • Total evaluations: 225 (15 personas × 5 contents × 3 repetitions)
  • Reason for repetition: Test-Retest reliability verification
  • LLM model: gpt-4o-mini (cost-efficient)
  • Embedding model: text-embedding-3-small (1536 dimensions)
  • Temperature: 0.7 (balanced consistency and diversity)
  • Execution time: Approximately 8 minutes 24 seconds
  • Total cost: Approximately $2.00

Implementation Code

The entire system was implemented in Python, with the following main modules:

OpenAI Client

from openai import AsyncOpenAI

class OpenAIClient:
    def __init__(self, api_key: str):
        self.client = AsyncOpenAI(api_key=api_key)
        self.llm_model = "gpt-4o-mini"
        self.embedding_model = "text-embedding-3-small"

    async def generate_response(
        self,
        prompt: str,
        temperature: float = 0.7
    ) -> str:
        """Generate LLM free-text response"""
        response = await self.client.chat.completions.create(
            model=self.llm_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature
        )
        return response.choices[0].message.content

    async def get_embedding(self, text: str) -> List[float]:
        """Generate text embedding"""
        response = await self.client.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        return response.data[0].embedding

SSR Rater

import numpy as np
from typing import Dict, List

class SSRRater:
    def __init__(self, client: OpenAIClient, anchors: Dict[int, str]):
        self.client = client
        self.anchors = anchors
        self.anchor_embeddings = {}

    async def initialize_anchors(self):
        """Pre-compute anchor embeddings (executed only once)"""
        for rating, text in self.anchors.items():
            self.anchor_embeddings[rating] = await self.client.get_embedding(text)

    def cosine_similarity(
        self,
        vec1: List[float],
        vec2: List[float]
    ) -> float:
        """Calculate cosine similarity"""
        vec1 = np.array(vec1)
        vec2 = np.array(vec2)
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

    def softmax(
        self,
        similarities: Dict[int, float],
        temperature: float = 1.0
    ) -> List[float]:
        """Generate softmax probability distribution"""
        values = np.array(list(similarities.values()))
        exp_values = np.exp(values / temperature)
        return exp_values / exp_values.sum()

    async def evaluate(
        self,
        persona: Dict,
        content: Dict,
        prompt_template: str
    ) -> Dict:
        """Execute SSR evaluation"""
        # 1. Generate free-text response
        prompt = prompt_template.format(**persona, **content)
        response = await self.client.generate_response(prompt)

        # 2. Embed response
        response_embedding = await self.client.get_embedding(response)

        # 3. Calculate similarity with anchors
        similarities = {}
        for rating, anchor_emb in self.anchor_embeddings.items():
            sim = self.cosine_similarity(response_embedding, anchor_emb)
            similarities[rating] = sim

        # 4. Generate probability distribution
        probabilities = self.softmax(similarities)

        # 5. Calculate expected value
        ratings = list(similarities.keys())
        expected_rating = sum(r * p for r, p in zip(ratings, probabilities))

        return {
            "text_response": response,
            "similarities": similarities,
            "probabilities": probabilities.tolist(),
            "expected_rating": expected_rating,
            "most_likely_rating": ratings[np.argmax(probabilities)]
        }

Survey Runner

import asyncio
from tqdm.asyncio import tqdm

class SurveyRunner:
    def __init__(self, rater: SSRRater):
        self.rater = rater

    async def run_survey(
        self,
        personas: List[Dict],
        contents: List[Dict],
        prompt_template: str,
        repetitions: int = 3,
        max_concurrent: int = 10
    ) -> List[Dict]:
        """Execute full survey"""
        # Create evaluation tasks
        evaluations = []
        for persona in personas:
            for content in contents:
                for rep in range(repetitions):
                    evaluations.append({
                        "persona": persona,
                        "content": content,
                        "repetition": rep + 1
                    })

        # Parallel execution (with progress bar)
        semaphore = asyncio.Semaphore(max_concurrent)

        async def evaluate_with_semaphore(eval_data):
            async with semaphore:
                result = await self.rater.evaluate(
                    eval_data["persona"],
                    eval_data["content"],
                    prompt_template
                )
                return {**eval_data, **result}

        results = []
        for coro in tqdm.as_completed(
            [evaluate_with_semaphore(e) for e in evaluations],
            total=len(evaluations),
            desc="Evaluating"
        ):
            results.append(await coro)

        return results

Experimental Results

Overall Statistics

MetricValue
Average Expected Rating3.078 / 5.0
Standard Deviation0.016
Minimum3.010
Maximum3.106
Median3.080

Rating Distribution

RatingFrequencyPercentage
1 point00.0%
2 points00.0%
3 points00.0%
4 points21997.3%
5 points62.7%

Interpretation:

  • Almost all evaluations (97.3%) resulted in “High revisit intent” (4 points)
  • Only a small minority (2.7%) showed “Very high revisit intent” (5 points)
  • Not a single evaluation below 3 points → All content is positive for inducing revisits
  • Average 3.078 is the expected value; the most likely rating is actually 4 points

Content Ranking

RankContentAverage RatingStd Dev
1Claude Code Best Practices3.0860.009
2GraphRAG and Multi-Agent Systems3.0820.016
3Screenshot to Code3.0820.017
4Data Mesh vs Data Warehouse3.0700.015
5Google Analytics MCP3.0700.013

Insights:

  • 1st place Claude Code: Highest rating + lowest standard deviation (0.009)
    • Very high interest in AI development workflows, consistently positive across all personas
  • 2-3rd place GraphRAG, Screenshot to Code: High interest in advanced AI technology and practical tools
  • 4-5th place Data Mesh, GA MCP: Data architecture and analytics tool automation
  • Difference between 1st and 5th is only 0.016 → All content maintains uniformly high quality

Persona Ranking

Top 5:

NameCountryOccupationAverage Rating
박지훈South KoreaFull-Stack Developer3.089
Alex JohnsonUSASenior Software Engineer3.088
Emily RobertsUKProduct Manager3.087
Pierre DuboisFranceData Scientist3.086
Wei ZhangSingaporeAI Product Developer3.083

Bottom 5:

NameCountryOccupationAverage Rating
Dr. Michael LeeUSAAI Researcher3.059
田中健太JapanData Engineer3.065
Hans MüllerGermanyML Engineer3.068
Carlos SantosBrazilBackend Developer3.069
Li WeiChinaAI Student3.070

Insights:

  • Developers from South Korea, USA, and Europe show high revisit intent
  • AI researcher (Dr. Michael Lee) is relatively lower but still positive at 3.059
  • Low standard deviation (0.009-0.025) → Consistent responses across repeated measurements

Visualizations

1. Rating Distribution Analysis

Distribution Analysis

Top Left: Most Likely Ratings - 97.3% concentrated on 4 points Top Right: Expected Ratings - Average 3.078, standard deviation 0.016 Bottom Left: Probability distribution by evaluation (first 20) Bottom Right: Results with Softmax Temperature 1.0 applied

2. Persona × Content Heatmap

Heatmap

  • Bright colors: High revisit intent
  • Dark colors: Relatively lower revisit intent
  • All cells are relatively bright → Overall high ratings

Findings:

  • 박지훈 (Full-Stack Developer): High ratings for all content
  • Claude Code Best Practices: High ratings from almost all personas
  • Japanese Data Engineer (田中): Relatively lower pattern (cultural/linguistic differences?)

3. Persona-wise Box Plot

Box Plot

Top: Rating distribution by persona - Most concentrated in 3.05-3.10 range Bottom: Rating distribution by content - Claude Code has highest median

4. Correlation Matrix

Correlation Matrix

Pearson correlation coefficients between 3 repetitions:

  • Rep1 vs Rep2: 0.73
  • Rep1 vs Rep3: 0.53
  • Rep2 vs Rep3: 0.62

Statistical Reliability Analysis

Test-Retest Reliability

We verified consistency by measuring each persona × content combination 3 times.

ICC (Intraclass Correlation Coefficient)

from scipy import stats

# Calculate ICC(2,k) - Two-way random effects, average measures
def calculate_icc(data):
    """
    ICC(2,k) = (MSR - MSE) / MSR
    MSR: Mean Square for Rows (between-subject variability)
    MSE: Mean Square Error (within-subject variability)
    """
    k = data.shape[1]  # number of raters (repetitions)
    n = data.shape[0]  # number of subjects

    # Sum of Squares
    subject_means = data.mean(axis=1)
    grand_mean = data.values.mean()

    SS_between = k * np.sum((subject_means - grand_mean) ** 2)
    SS_within = np.sum((data.values - subject_means.values[:, np.newaxis]) ** 2)

    # Mean Squares
    MS_between = SS_between / (n - 1)
    MS_within = SS_within / (n * (k - 1))

    # ICC
    icc = (MS_between - MS_within) / MS_between
    return icc

icc_score = calculate_icc(pivot_data)  # 0.8330

Result: ICC = 0.8330

Interpretation:

  • 0.75 or above: Good reliability
  • 0.85 or above: Excellent reliability
  • 0.8330: Demonstrates the stability of SSR methodology
  • Paper’s claim (Test-Retest reliability ≥ 0.85 at 90% level) verified

Pearson Correlation Coefficients

ComparisonCorrelation (r)Interpretation
Repetition 1 vs 20.7301High correlation
Repetition 1 vs 30.5298Moderate correlation
Repetition 2 vs 30.6246Moderate-high correlation

Overall Assessment:

  • Very high reliability: Standard deviation < 0.01 (8 people)
  • High reliability: Standard deviation 0.01-0.02 (6 people)
  • ⚠️ Moderate reliability: Standard deviation 0.02-0.03 (1 person)

Meaning of Reliability

SSR Methodology Verification:

  • Standard deviation < 0.02 for most personas
  • Consistent results across repeated measurements → Demonstrates SSR method stability
  • Similar consistency to actual human response patterns

Cost Analysis

Actual Cost

ItemQuantityUnit PriceCost
Anchor Embedding5 times$0.00001/token × ~20 tokens$0.0010
LLM Response Generation225 times$0.15/1M tokens × ~100 tokens$3.38
Response Embedding225 times$0.00001/token × ~50 tokens$0.11
Total Cost--~$3.50

Actual Measurement:

  • Expected cost: $2-3
  • Actual cost: About $3.50 (more tokens than expected)
  • Cost per evaluation: $0.016

Cost Efficiency

Comparison with Traditional Surveys:

MethodCost per RespondentCost for 225 ResponsesTime Required
Traditional Survey$1-5$225-1,1251-2 weeks
SSR$0.016$3.508 minutes

Savings:

  • Cost: 95-99% reduction
  • Time: 99% reduction
  • Scale: No constraints (thousands to tens of thousands of evaluations possible)

Additional Benefits

Qualitative Benefits:

  1. Rich Context: Detailed text responses for each evaluation
  2. Immediate Execution: Instant results with just API calls
  3. Easy Repetition: Simple re-evaluation when content changes
  4. A/B Testing: Simultaneous testing of multiple versions

Key Findings

1. Overall High Revisit Intent

  • Average 3.078/5.0 → Most at “High revisit intent” (4 points) level
  • 97.3% at 4 points, only 2.7% at 5 → Content quality is excellent but not “perfect”
  • Room for improvement: Need to strengthen content for 4→5 point conversion

2. Small Differences Between Contents

  • Difference between 1st (Claude Code) and 5th (GA MCP) is only 0.016
  • All content maintains uniformly high quality
  • Not biased toward specific categories

3. Developer-Centric Content Ranks Higher

  • Claude Code, GraphRAG, Screenshot to Code are top 3
  • Strategy: Strengthen AI development tools and workflow content
  • High demand for practical guides

4. Minimal Regional/Occupational Differences

  • South Korea (박지훈 3.089) vs Japan (田中 3.065) difference 0.024
  • US Senior Dev (Alex 3.088) vs AI Researcher (Michael 3.059) difference 0.029
  • Universal Interest: AI development trends transcend country/occupation

5. High Methodology Reliability

  • Average standard deviation 0.014 → Excellent consistency in repeated measurements
  • ICC 0.833 → Demonstrates SSR methodology stability
  • Similar to actual human response patterns

Blog Operation Utilization

1. Content Strategy

Priority Content:

  • Expand Claude Code Series: Highest interest, so create sequels
    • Part 2: Advanced patterns
    • Part 3: Production use cases
  • Focus on AI Development Workflow: Strengthen developer-centric content
  • Deepen GraphRAG/Multi-Agent: Confirmed demand for advanced topics

4→5 Point Conversion Strategy:

  • Add hands-on examples (Hands-on Tutorials)
  • Include case studies (Real-world Examples)
  • Provide code repositories (GitHub Repos)
  • Supplement with video tutorials

2. Target Audience Analysis

Core Readership:

  • Developers from USA, South Korea, Europe
  • AI/ML Engineers, Full-Stack Developers
  • Tech Workers aged 25-40

Expandable Readership:

  • Developers from Japan, Brazil (3.06-3.07 level)
  • Data Analysts, Product Managers (data-driven decision making)

Multilingual Content Priority:

  1. English (Essential - global audience)
  2. Korean (Core - domestic audience)
  3. Japanese (Expansion - potential audience)

3. Additional Research Topics

Quantitative Analysis:

  • 4→5 Point Conversion Factors analysis: What elements induce “very high” revisit intent?
  • Persona-specific Preferred Content: Build occupation-specific customized recommendation system
  • Time Series Analysis: Changes in revisit intent over time after content publication

Qualitative Analysis:

  • Text Response Analysis: Extract key keywords from free responses
  • Sentiment Analysis: Ratio of positive/negative sentiment
  • Topic Modeling: Discover hidden topics with LDA/BERTopic

Limitations and Improvement Directions

Current Limitations

1. LLM Bias

  • Bias exists toward Western, English-speaking, developed countries
  • May not adequately reflect consumption patterns of certain cultural regions

2. Limitations of Synthetic Personas

  • Not completely identical to actual human responses
  • Difficulty capturing subtle cultural nuances

3. Impact of Anchor Sentences

  • Results may vary depending on anchor sentence selection
  • Need domain-specific optimization

Improvement Directions

1. Validation with Real Data

  • Validate SSR results with small-scale actual surveys
  • Measure accuracy through A/B testing

2. Use Multiple Models

  • Reduce bias by ensembling results from multiple LLMs
  • Compare GPT-4, Claude, Gemini

3. Prompt Engineering

  • Improve prompts considering cultural context
  • Refine persona definitions

4. Continuous Monitoring

  • Regularly re-measure reliability
  • Immediately evaluate upon new content publication

Conclusion

Summary of Achievements

  • 100% success for 225 evaluations (8 minutes 24 seconds, about $3.50 cost)
  • Average revisit intent 3.078/5.0 → All content is positive
  • SSR methodology validated → High Test-Retest reliability (ICC 0.833)
  • Actionable insights derived → Content strategy formulation possible

Blog Operation Recommendations

  1. Expand Claude Code Series: Highest interest
  2. Strengthen AI Development Workflow Content: GraphRAG, Multi-Agent, Screenshot-to-Code
  3. Multilingual Support: Prioritize English, Korean, Japanese
  4. 4→5 Point Conversion Strategy: Deepen content quality (add hands-on examples, case studies)

Potential of SSR Methodology

SSR goes beyond being just a survey tool to become an innovative tool for content strategy formulation:

Applicable Areas:

  • Blog content revisit intent (this study)
  • Product purchase intent (original SSR purpose)
  • Service subscription intent
  • Ad click intent
  • Brand preference

Core Contributions:

  • ✓ Cost-efficient large-scale evaluation ($0.016 per evaluation)
  • ✓ Rapid iterative experiments (results in minutes)
  • ✓ Combination of quantitative evaluation + qualitative insights
  • ✓ High reliability (ICC 0.833)

The consumer research and content strategy fields are facing a new turning point in the AI era. Technologies like SSR will become powerful tools for blog operators and marketers to create better content and understand their audience more deeply.

References

Academic Papers

  • [arXiv 2510.08338] LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Implementation References

Blog Posts


📊 Code & Data: The complete code and data used in this analysis are available in the GitHub repository.

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.