DeNA LLM Study Part 4: RAG Architecture and Latest Trends

DeNA LLM Study Part 4: RAG Architecture and Latest Trends

Exploring RAG from core concepts to GraphRAG and Agentic RAG through DeNA LLM Study Series Part 4, covering cutting-edge retrieval-augmented generation techniques.

Series: DeNA LLM Study (4/5)

  1. Part 1: LLM Fundamentals and 2025 AI Landscape
  2. Part 2: Structured Output and Multi-LLM Pipelines
  3. Part 3: Model Training Methodologies
  4. Part 4: RAG Architecture and Latest Trends ← Current Article
  5. Part 5: Agent Design and Multi-Agent Orchestration

Overview

DeNA’s LLM Study Series Part 4 covers RAG (Retrieval-Augmented Generation) from core concepts to the latest trends. We’ve learned how to design systems that effectively leverage external knowledge beyond simple prompt engineering.

This article organizes the complete RAG architecture, hybrid search strategies, reranking techniques, and cutting-edge developments like GraphRAG and Agentic RAG based on DeNA’s study materials.

Context Engineering: LLM as Interface

The core concept emphasized in DeNA’s study is that “LLM is just an interface; the retrieval system is the real core”.

Beyond Prompt Engineering

Traditional prompt engineering focused on giving better instructions to LLMs. However, in RAG systems:

  • Retrieval quality determines response quality
  • Context selection prevents hallucination
  • System design optimizes performance and cost
graph TD
    User[User Query] --> Query[Query Analysis]
    Query --> Search[Retrieval System]
    Search --> Retrieve[Document Search]
    Retrieve --> Rerank[Reranking]
    Rerank --> Context[Context Assembly]
    Context --> LLM[LLM Generation]
    LLM --> Response[Response]

    style Search fill:#e1f5ff
    style Rerank fill:#fff4e1
    style LLM fill:#ffe1f5

Core Value of RAG

  1. Latest Information: Access to post-training knowledge
  2. Domain Knowledge Integration: Leverage internal documents and specialized knowledge
  3. Hallucination Prevention: Generate grounded responses
  4. Traceability: Build trust through citation

Complete RAG Architecture

DeNA’s study divides RAG into five stages:

1. Document Indexing

graph LR
    Doc[Source Docs] --> Chunk[Chunking]
    Chunk --> Embed[Embedding]
    Embed --> Store[Vector Store]

    style Chunk fill:#e1f5ff
    style Embed fill:#fff4e1

Chunking Strategies:

  • Fixed Size: Split by 512 token units
  • Semantic-based: Split by paragraphs or sections
  • Overlap: 50〜100 token overlap for context preservation

Embedding Selection:

  • OpenAI text-embedding-3: Versatility, API convenience
  • Cohere Embed v3: Multilingual support, compressed embeddings
  • BGE Series: Open-source, customizable

2. Query Expansion

Techniques to enrich short user queries:

# HyDE (Hypothetical Document Embeddings)
query = "How to improve RAG performance?"

# 1. LLM generates hypothetical answer
hypothetical_answer = llm.generate(f"""
Write a detailed answer to this question:
{query}
""")

# 2. Embed and search using hypothetical answer
embedding = embed_model.encode(hypothetical_answer)
results = vector_store.search(embedding, top_k=5)

Query Expansion Techniques:

  • HyDE: Generate hypothetical document, then search
  • Multi-Query: Generate queries from multiple perspectives
  • Query Decomposition: Break complex queries into sub-queries

DeNA study’s core emphasis: BM25 + Dense + Sparse combination

# Hybrid search implementation
def hybrid_search(query, alpha=0.5):
    # 1. BM25 (keyword-based)
    bm25_scores = bm25_retriever.search(query, top_k=20)

    # 2. Dense Vector (semantic-based)
    dense_embedding = dense_model.encode(query)
    dense_scores = vector_store.search(dense_embedding, top_k=20)

    # 3. Sparse Vector (important token-based)
    sparse_embedding = splade_model.encode(query)
    sparse_scores = sparse_store.search(sparse_embedding, top_k=20)

    # 4. Score combination (weighted average)
    combined_scores = (
        alpha * bm25_scores +
        (1 - alpha) * 0.7 * dense_scores +
        (1 - alpha) * 0.3 * sparse_scores
    )

    return combined_scores.top_k(10)

Characteristics of Each Method:

MethodStrengthsWeaknesses
BM25Precise keyword matching, fastLacks semantic understanding
DenseCaptures semantic similarityCan miss keywords
SparseEmphasizes important tokensHigh computational cost

4. Reranking

Stage for more precise reordering of retrieved documents:

graph LR
    Initial[Initial Search<br/>100 docs] --> FirstFilter[1st Filter<br/>20 docs]
    FirstFilter --> Rerank[Reranker<br/>Precise Scoring]
    Rerank --> Final[Final Selection<br/>5 docs]

    style Rerank fill:#ffe1f5

Reranking Model Selection:

  1. ColBERT (Late Interaction)

    • Token-level similarity between query and document
    • Balance of speed and accuracy
    • Advantage: Fast and effective
  2. Cross-Encoder

    • Encodes query and document together
    • Highest accuracy
    • Disadvantage: Slow (requires computing all pairs)
  3. BGE-reranker

    • Open-source Cross-Encoder
    • Multilingual support
    • Widely used in practice
# BGE-reranker usage example
from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)

# Rerank retrieved documents
pairs = [[query, doc.text] for doc in retrieved_docs]
scores = reranker.compute_score(pairs)

# Sort by score
reranked_docs = sorted(
    zip(retrieved_docs, scores),
    key=lambda x: x[1],
    reverse=True
)[:5]

5. Generation

Generate LLM response with finally selected context:

def generate_with_citations(query, top_docs):
    # Assemble context
    context = "\n\n".join([
        f"[{i+1}] {doc.text}\nSource: {doc.source}"
        for i, doc in enumerate(top_docs)
    ])

    prompt = f"""
Answer the question based on the following context.
Always cite source numbers (e.g., [1], [2]).

Context:
{context}

Question: {query}

Answer:"""

    response = llm.generate(prompt)
    return response

Embedding Model Comparison

Major embedding models covered in DeNA’s study:

OpenAI text-embedding-3

from openai import OpenAI

client = OpenAI()

# Small model (cheap, fast)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Your text here"
)
embedding_small = response.data[0].embedding  # 1536 dimensions

# Large model (high quality)
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Your text here",
    dimensions=3072  # max 3072 dimensions
)
embedding_large = response.data[0].embedding

Features:

  • API-based for ease of use
  • High versatility
  • Cost-efficient (small: $0.02/1M tokens)

Cohere Embed v3

import cohere

co = cohere.Client('your-api-key')

# Multilingual embeddings
response = co.embed(
    texts=["Korean text", "English text", "日本語テキスト"],
    model="embed-multilingual-v3.0",
    input_type="search_query"  # or "search_document"
)

embeddings = response.embeddings  # 1024 dimensions

Features:

  • 100+ language support
  • Optimized by input type (query vs document)
  • Compressed embedding support (128〜1024 dimensions)

BGE (Beijing Academy of AI) Series

from FlagEmbedding import FlagModel

# BGE-M3: Multi-functional embedding
model = FlagModel('BAAI/bge-m3', use_fp16=True)

# Dense embedding
dense_vecs = model.encode(
    ["Query text"],
    return_dense=True,
    return_sparse=False,
    return_colbert_vecs=False
)

# Sparse embedding (SPLADE-like)
sparse_vecs = model.encode(
    ["Query text"],
    return_dense=False,
    return_sparse=True,
    return_colbert_vecs=False
)

# ColBERT-style multi-vector
colbert_vecs = model.encode(
    ["Query text"],
    return_dense=False,
    return_sparse=False,
    return_colbert_vecs=True
)

Features:

  • Open-source (commercial use allowed)
  • Supports all 3 retrieval methods (Dense, Sparse, Multi-vector)
  • Long context support (up to 8192 tokens)
  • 100+ language support

Grounding: Hallucination Prevention Strategies

One of RAG’s most important goals is preventing hallucination.

1. Citation Enforcement

system_prompt = """
You are an AI that answers using only the provided context.
Follow these rules strictly:

1. Cite source numbers for all claims [1], [2], etc.
2. State "Not in provided information" for absent information
3. Express "Uncertain" when not confident
4. State facts only, do not speculate
"""

2. Uncertainty Expression

def generate_with_confidence(query, context):
    prompt = f"""
Context: {context}

Question: {query}

Answer in the following format:
- Answer: [Your answer]
- Confidence: [High/Medium/Low]
- Evidence: [Quote relevant parts from context]
"""
    return llm.generate(prompt)

3. Self-RAG: Self-Verification

Self-RAG allows LLMs to judge retrieval necessity and verify responses autonomously.

graph TD
    Query[User Query] --> NeedRetrieval{Need Retrieval?}
    NeedRetrieval -->|Yes| Retrieve[Document Search]
    NeedRetrieval -->|No| DirectGen[Direct Generation]
    Retrieve --> Generate[Generate Answer]
    Generate --> Verify{Verify}
    Verify -->|Supported| Output[Output]
    Verify -->|Insufficient| Retrieve
    DirectGen --> Output

    style Verify fill:#ffe1f5
def self_rag(query):
    # 1. Judge retrieval necessity
    need_retrieval = llm.classify(
        f"Does this question need external information? {query}"
    )

    if need_retrieval:
        # 2. Retrieve documents
        docs = retriever.search(query)

        # 3. Generate answer
        response = llm.generate_with_context(query, docs)

        # 4. Verify answer
        is_supported = llm.verify(
            f"Is this answer sufficiently supported by context? Answer: {response}"
        )

        if not is_supported:
            # Need re-retrieval or more documents
            return self_rag(query)  # Recursive call
    else:
        response = llm.generate(query)

    return response

Latest RAG development directions covered in DeNA’s study:

1. GraphRAG

Knowledge graph-based RAG released by Microsoft in 2024:

graph TD
    Docs[Source Docs] --> Extract[Entity/Relation<br/>Extraction]
    Extract --> Graph[Knowledge Graph<br/>Construction]
    Graph --> Community[Community<br/>Detection]
    Community --> Summary[Hierarchical<br/>Summarization]

    Query[User Query] --> GraphSearch[Graph Search]
    GraphSearch --> Summary
    Summary --> LLM[Generate Answer]

    style Graph fill:#e1f5ff
    style Community fill:#fff4e1

GraphRAG Advantages:

  • Relation-based Reasoning: Leverage connections between entities
  • Multi-hop Reasoning: Handle complex queries like “C who knows B of A”
  • Holistic Context: Understand connections across documents

Use Cases:

  • Organization chart-based Q&A
  • Legal document precedent references
  • Academic paper citation analysis

2. Agentic RAG

Autonomous RAG emerged recently, in “the era of agents”:

graph TD
    Query[User Query] --> Agent[RAG Agent]
    Agent --> Plan[Planning]
    Plan --> Tool1[Tool 1:<br/>Vector Search]
    Plan --> Tool2[Tool 2:<br/>Keyword Search]
    Plan --> Tool3[Tool 3:<br/>Web Search]

    Tool1 --> Eval{Evaluate}
    Tool2 --> Eval
    Tool3 --> Eval

    Eval -->|Insufficient| Agent
    Eval -->|Sufficient| Generate[Generate Answer]

    style Agent fill:#ffe1f5
    style Eval fill:#fff4e1

Traditional RAG vs Agentic RAG:

Traditional RAGAgentic RAG
Single retrieval stepIterative retrieval
Fixed pipelineDynamic tool selection
Reactive to user queryPlanning and execution
Terminates on retrieval failureRetry with strategy change

Implementation Example (LangGraph):

from langgraph.graph import StateGraph, END
from langchain.tools import Tool

# Define RAG agent
class RAGAgent:
    def __init__(self):
        self.tools = [
            Tool(name="vector_search", func=self.vector_search),
            Tool(name="keyword_search", func=self.keyword_search),
            Tool(name="web_search", func=self.web_search)
        ]

    def plan(self, query):
        # LLM determines retrieval strategy
        plan = self.llm.generate(f"""
        Plan a retrieval strategy to answer this question:
        {query}

        Available tools: {[tool.name for tool in self.tools]}
        """)
        return plan

    def execute(self, query):
        max_iterations = 3
        context = []

        for i in range(max_iterations):
            # Create plan
            plan = self.plan(query)

            # Execute tools
            results = self.execute_tools(plan)
            context.extend(results)

            # Evaluate if information is sufficient
            is_sufficient = self.evaluate(query, context)

            if is_sufficient:
                break

        # Generate final response
        return self.generate_response(query, context)

3. Long RAG

RAG variant for long context processing:

Problem: Traditional RAG operates within limited context windows (4K〜8K tokens)

Solutions:

  1. Hierarchical Retrieval: Narrow down from chapter → section → paragraph
  2. Streaming Context: Load only necessary parts sequentially
  3. Summary-based Retrieval: Search long documents via summaries first
def long_rag(query, long_documents):
    # Stage 1: Select candidates via document summaries
    summaries = [doc.summary for doc in long_documents]
    candidate_docs = vector_search(query, summaries, top_k=3)

    # Stage 2: Detailed search within selected documents
    detailed_chunks = []
    for doc in candidate_docs:
        chunks = chunk_document(doc, chunk_size=512)
        relevant_chunks = vector_search(query, chunks, top_k=5)
        detailed_chunks.extend(relevant_chunks)

    # Stage 3: Generate answer with final context
    return generate_response(query, detailed_chunks)

4. Multimodal RAG (ColPali)

Retrieve not just text, but images, tables, and diagrams:

ColPali: Embed entire document pages as images

from colpali import ColPali

# Embed document page images
model = ColPali()
page_embeddings = model.encode_images([
    "doc1_page1.png",
    "doc1_page2.png",
    "doc2_page1.png"
])

# Search images with text query
query_embedding = model.encode_text("What is the net profit in financial statement?")
similar_pages = vector_search(query_embedding, page_embeddings)

# Pass retrieved page images to Vision LLM
response = vision_llm.generate_with_image(
    query="What is the net profit in financial statement?",
    images=similar_pages
)

Multimodal RAG Advantages:

  • Layout Preservation: Maintain original PDF tables and charts
  • No OCR Needed: Process images directly
  • Visual Context: Leverage diagrams and graphs

Practical Application Insights

Insights for real-world application from DeNA’s study:

1. Incremental Optimization Strategy

graph LR
    Basic[Basic RAG<br/>Dense Only] --> Hybrid[Hybrid<br/>+ BM25]
    Hybrid --> Rerank[Reranking<br/>+ BGE-reranker]
    Rerank --> Advanced[Advanced<br/>+ GraphRAG]

    style Basic fill:#e1f5ff
    style Hybrid fill:#fff4e1
    style Rerank fill:#ffe1f5
    style Advanced fill:#e1ffe1
  1. Stage 1: Basic RAG

    • Use dense vector search only
    • Build rapid prototype
    • Measure baseline performance
  2. Stage 2: Hybrid Search

    • Add BM25 for keyword matching
    • Expect 10〜20% performance gain
  3. Stage 3: Reranking

    • Improve precision with BGE-reranker
    • Additional 15〜25% improvement
  4. Stage 4: Advanced Techniques

    • Apply GraphRAG, Agentic RAG per domain
    • Enhance complex question handling

2. Evaluation Metrics Setup

Key metrics for measuring RAG system performance:

# Retrieval quality metrics
def evaluate_retrieval(queries, ground_truth):
    metrics = {
        'recall@k': [],  # Inclusion rate of correct docs in top k
        'mrr': [],       # Mean Reciprocal Rank
        'ndcg': []       # Normalized Discounted Cumulative Gain
    }

    for query, truth in zip(queries, ground_truth):
        retrieved = retriever.search(query, top_k=10)

        # Recall@10
        recall = len(set(retrieved) & set(truth)) / len(truth)
        metrics['recall@k'].append(recall)

        # MRR
        for i, doc in enumerate(retrieved):
            if doc in truth:
                metrics['mrr'].append(1 / (i + 1))
                break

    return {k: sum(v) / len(v) for k, v in metrics.items()}

# Generation quality metrics
def evaluate_generation(responses, references):
    from ragas import evaluate

    return evaluate(
        responses=responses,
        references=references,
        metrics=['answer_relevancy', 'faithfulness', 'context_precision']
    )

Target Values (DeNA study recommendations):

  • Recall@10: 0.8+ (retrieve 80%+ of correct documents)
  • MRR: 0.6+ (correct answer in top 2 on average)
  • Answer Relevancy: 0.9+
  • Faithfulness: 0.95+ (minimize hallucination)

3. Cost Optimization

RAG system cost structure:

graph TD
    Cost[Total Cost] --> Embed[Embedding Cost]
    Cost --> Storage[Storage Cost]
    Cost --> Compute[Compute Cost]
    Cost --> LLM[LLM Cost]

    Embed --> EmbedOpt[Batch Processing<br/>Caching]
    Storage --> StorageOpt[Compressed Embedding<br/>Index Optimization]
    Compute --> ComputeOpt[Efficient Search<br/>Selective Reranking]
    LLM --> LLMOpt[Context Compression<br/>Smaller Models]

    style EmbedOpt fill:#e1f5ff
    style StorageOpt fill:#fff4e1
    style ComputeOpt fill:#ffe1f5
    style LLMOpt fill:#e1ffe1

Cost Reduction Strategies:

  1. Embedding Optimization

    # Reduce API calls with batch processing
    batch_size = 100
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        embeddings.extend(embed_model.encode(batch))
    
    # Cache embeddings
    import pickle
    with open('embeddings_cache.pkl', 'wb') as f:
        pickle.dump(embeddings, f)
  2. Context Compression

    def compress_context(docs, max_tokens=2000):
        # Extract only important sentences
        sentences = extract_sentences(docs)
        scores = compute_relevance(sentences, query)
    
        # Select most relevant sentences within token limit
        selected = []
        total_tokens = 0
        for sent, score in sorted(zip(sentences, scores),
                                   key=lambda x: x[1],
                                   reverse=True):
            sent_tokens = count_tokens(sent)
            if total_tokens + sent_tokens <= max_tokens:
                selected.append(sent)
                total_tokens += sent_tokens
    
        return " ".join(selected)
  3. Caching Strategy

    from functools import lru_cache
    
    @lru_cache(maxsize=1000)
    def cached_retrieval(query_hash):
        return retriever.search(query_hash)
    
    # Usage
    query_hash = hash(query)
    results = cached_retrieval(query_hash)

4. Security and Privacy

Security considerations for RAG systems:

Data Isolation:

def secure_rag(query, user_id):
    # Verify user document access permissions
    allowed_docs = get_user_documents(user_id)

    # Search filtered vector store
    results = vector_store.search(
        query,
        filter={"doc_id": {"$in": allowed_docs}}
    )

    return results

Sensitive Information Filtering:

import re

def sanitize_response(response):
    # Remove personal information patterns
    patterns = {
        'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'phone': r'\b\d{3}[-.]?\d{3,4}[-.]?\d{4}\b',
        'ssn': r'\b\d{3}[-]?\d{2}[-]?\d{4}\b'
    }

    for name, pattern in patterns.items():
        response = re.sub(pattern, f'[{name.upper()}_REDACTED]', response)

    return response

Reflections and Next Steps

Through DeNA’s LLM Study Part 4, I’ve gained a deep understanding that RAG is not simply “retrieve and generate,” but an engineering domain requiring sophisticated system design.

Key Insights

  1. Retrieval is Core: LLM is interface; real value lies in accurate context retrieval
  2. Hybrid Approach: Combine Dense, Sparse, BM25 for best performance
  3. Reranking Essential: Precisely filter initial search results
  4. Hallucination Prevention: Ensure reliability through citation, verification, Self-RAG
  5. Evolving Paradigm: Continuous evolution toward GraphRAG, Agentic RAG

Practical Application Plans

Based on this learning, I’m planning the following improvements:

  1. Introduce Hybrid Search

    • Currently using Dense vectors only → Add BM25
    • Apply BGE-M3 model for multi-functional embedding
  2. Build Reranking Pipeline

    • Integrate BGE-reranker-large
    • Two-stage search (100 → 20 → 5 documents)
  3. Establish Evaluation Framework

    • Measure Recall@10, MRR, NDCG
    • Build A/B testing framework
  4. Experiment with Agentic RAG

    • Dynamic search strategy with LangGraph
    • Handle complex multi-step questions

Next Learning: Part 5 - Production Deployment and Monitoring

In the final Part 5 of DeNA’s study series, we’ll cover:

  • Production deployment strategies for LLM systems
  • Performance monitoring and logging
  • A/B testing and continuous improvement
  • Cost optimization and scalability

References

Papers and Surveys

Open Source Projects

Tools and Platforms

Additional Learning Resources


Series Continues: Part 5: Agent Design and Multi-Agent Orchestration

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.