Gemini Embedding 2 — How Multimodal Embeddings Change RAG

Gemini Embedding 2 — How Multimodal Embeddings Change RAG

A deep dive into Google

Why Multimodal Embeddings

On March 10, 2026, Google announced Gemini Embedding 2 — described as “our first native multimodal embedding model.” It maps text, images, video, audio, and documents into a single vector space.

The biggest limitation of existing RAG pipelines was that they could only handle text. Even when an internal wiki contained diagrams, or a product manual included screenshots, all of that was ignored at the embedding stage. As a result, searches repeatedly failed to surface relevant information despite it being present in the knowledge base.

Gemini Embedding 2 addresses this problem at its root.


Gemini Embedding 2 Key Specs

Input Modalities

ModalityCoverageConstraints
TextUp to 8,192 tokens100+ languages supported
ImagesUp to 6 per requestPNG, JPEG
VideoUp to 120 secondsMP4, MOV
AudioNative processingNo intermediate text conversion needed
DocumentsComplex docs like PDFsMixed text + image processing

Output Dimensions

The default output is a 3,072-dimensional vector. The key here is the application of Matryoshka Representation Learning (MRL). Like Russian nesting dolls, information is arranged in a nested structure, so core information is preserved in the higher dimensions even when the dimensionality is reduced.

3072 dimensions (highest precision)
 └── 1536 dimensions (high precision)
      └── 768 dimensions (general purpose)
           └── 256 dimensions (lightweight, mobile/edge)

The reason this matters in practice is that it allows flexible tuning of the cost-accuracy tradeoff. When indexing millions of documents, a two-stage strategy becomes feasible: first-pass filtering with 256 dimensions, then re-ranking top candidates with 3,072 dimensions.

API Access

Two gateways are available:

  • Gemini API (AI Studio): For prototyping and individual developers. Includes a free tier.
  • Vertex AI (Google Cloud): Enterprise scale. VPC-SC, CMEK, and IAM integration.

Comparison with Existing Embedding Models

Single-Modal vs. Multimodal

graph TD
    subgraph Legacy["Legacy Pipeline"]
        T1["Text Embedding<br/>(text-embedding-3)"] --> VS1["Vector Store"]
        I1["Image Embedding<br/>(CLIP)"] --> VS2["Separate Vector Store"]
        A1["Audio<br/>(Whisper → Text)"] --> T1
    end
    subgraph New["Gemini Embedding 2"]
        T2["Text"] --> GE["Unified Embedding Model"]
        I2["Image"] --> GE
        V2["Video"] --> GE
        A2["Audio"] --> GE
        GE --> VS3["Single Vector Store"]
    end

Three problems with the legacy approach:

  1. Pipeline complexity: Each modality required a separate model, separate store, and separate retrieval logic
  2. No cross-modal search: Queries like “find code related to this diagram” were impossible
  3. Intermediate conversion loss: Converting audio to text lost nuance and context

Embedding Model Spec Comparison

ModelModalitiesMax DimensionsMRLPrice (per 1M tokens)
OpenAI text-embedding-3-largeText only3,072Yes$0.13
Cohere embed-v4Text + Image1,024Yes$0.10
Gemini Embedding 2Text + Image + Video + Audio3,072YesFree (preview)
Voyage AI voyage-3Text only1,024No$0.06

Gemini Embedding 2’s differentiator is clear. It is the only model to natively support all four modalities, with top-tier output dimensions, and is currently free during the preview period.


Practical Application: Building a Multimodal RAG Pipeline

Architecture Design

graph TD
    subgraph Ingestion["Data Ingestion"]
        DOC["Internal Docs<br/>(PDF, Wiki)"]
        IMG["Images<br/>(Diagrams, Screenshots)"]
        VID["Meeting Recordings<br/>(MP4)"]
        AUD["Customer Calls<br/>(Voice)"]
    end
    subgraph Embedding["Embedding Processing"]
        DOC --> GE2["Gemini Embedding 2<br/>API"]
        IMG --> GE2
        VID --> GE2
        AUD --> GE2
    end
    subgraph Storage["Vector Storage"]
        GE2 --> PG["pgvector /<br/>Pinecone /<br/>Weaviate"]
    end
    subgraph Retrieval["Retrieval & Generation"]
        Q["User Query"] --> QE["Query Embedding"]
        QE --> PG
        PG --> RR["Re-ranking"]
        RR --> LLM["Gemini Pro /<br/>Claude"]
        LLM --> ANS["Response"]
    end

Code Example: Python SDK

from google import genai

# Initialize client
client = genai.Client(api_key="YOUR_API_KEY")

# Text embedding
text_result = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents=["Key clauses from the internal security policy document"],
    config={
        "output_dimensionality": 768,  # Reduce dimensions via MRL
        "task_type": "RETRIEVAL_DOCUMENT"
    }
)
print(f"Text vector dimensions: {len(text_result.embeddings[0].values)}")
# Output: Text vector dimensions: 768

# Image embedding (same vector space)
from google.genai import types

image = types.Part.from_uri(
    file_uri="gs://my-bucket/architecture-diagram.png",
    mime_type="image/png"
)
image_result = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents=[image]
)

# Cosine similarity between text and image vectors is now possible
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity = cosine_similarity(
    text_result.embeddings[0].values,
    image_result.embeddings[0].values
)
print(f"Text-image similarity: {similarity:.4f}")

Task Type Strategy

Gemini Embedding 2 lets you specify the embedding purpose using the task_type parameter:

Task TypePurposeUse Case
RETRIEVAL_DOCUMENTDocument indexingWhen storing RAG documents
RETRIEVAL_QUERYQuery encodingWhen processing user search queries
SEMANTIC_SIMILARITYSimilarity comparisonDuplicate detection, clustering
CLASSIFICATIONClassificationAutomatic document classification, labeling
CLUSTERINGClusteringTopic modeling, grouping

Pro tip: Always use different task types for indexing and retrieval. Using RETRIEVAL_DOCUMENT when storing documents and RETRIEVAL_QUERY when querying significantly improves asymmetric retrieval performance.


EM/CTO Perspective: Adoption Considerations

1. Pipeline Simplification = Reduced Operational Costs

The most direct benefit of adopting multimodal embeddings is reduced pipeline complexity.

If you currently operate separate embedding pipelines per modality:

  • 3–4 models → 1 model
  • 2–3 vector stores → 1 vector store
  • Synchronization logic eliminated
  • Fewer systems to monitor

According to Google’s official blog, some customers have achieved 70% latency reduction.

2. Vendor Dependency Assessment

Gemini Embedding 2 is currently Google-exclusive. For organizations running a multi-cloud strategy:

  • Abstract the embedding layer: Design the embedding model as a swappable interface
  • Vector format compatibility: 3,072-dimension vectors are compatible with most vector databases
  • Leverage MRL: Dimension reduction makes it possible to match dimensions with other models

3. Data Governance

Sending multimodal data to an external API introduces governance considerations:

  • On Vertex AI, VPC Service Controls can define data perimeters
  • CMEK (Customer-Managed Encryption Keys) is supported
  • PII masking is recommended before embedding meeting recordings and customer call audio
  • If Data Residency requirements apply, confirm region selection carefully

4. Cost Model Forecasting

The model is currently free during preview, but billing is expected after GA. Cost optimization strategy:

Indexing:     256 dimensions (MRL)  → 87% storage cost reduction vs. 3072
First pass:   256-dim ANN search    → fast and inexpensive
Re-ranking:   3072-dim exact match  → top 50 candidates only

This two-stage strategy simultaneously optimizes cost and accuracy at the scale of millions of documents.


Production Migration Checklist

When transitioning from a text-only RAG to a multimodal RAG:

  1. Data inventory: Assess the current state of non-text data in your organization (images, video, audio)
  2. Prioritization: Apply multimodal indexing first to document types with the highest search failure rate
  3. Vector DB compatibility: Confirm your existing vector store supports 3,072 dimensions (pgvector, Pinecone, and Weaviate all do)
  4. A/B testing: Quantitatively compare retrieval accuracy between existing text-only and multimodal embeddings
  5. Monitoring: Track cross-modal search rate, latency, and embedding API call volume
  6. Security review: Obtain security and compliance approval for external transmission of multimodal data

Conclusion

Gemini Embedding 2 is not simply “a new embedding model.” It is a paradigm shift in RAG pipeline architecture.

Search systems that could previously only handle text can now perform unified retrieval across images, video, and audio within the same vector space. This is not just a technical advancement — it is a change that can fundamentally transform how organizations leverage unstructured data.

Key action items from an Engineering Manager’s perspective:

  1. Now: Run a PoC using the Gemini API during the preview period (free)
  2. Within 1–2 weeks: Create an inventory of non-text data within your organization
  3. Within 1 month: Design an A/B test comparing multimodal RAG against your existing RAG pipeline

References

Read in Other Languages

Was this helpful?

Your support helps me create better content. Buy me a coffee! ☕

About the Author

JK

Kim Jangwook

Full-Stack Developer specializing in AI/LLM

Building AI agent systems, LLM applications, and automation solutions with 10+ years of web development experience. Sharing practical insights on Claude Code, MCP, and RAG systems.