Gemini Embedding 2 — How Multimodal Embeddings Change RAG

Why Multimodal Embeddings

On March 10, 2026, Google announced Gemini Embedding 2 — described as “our first native multimodal embedding model.” It maps text, images, video, audio, and documents into a single vector space.

The biggest limitation of existing RAG pipelines was that they could only handle text. Even when an internal wiki contained diagrams, or a product manual included screenshots, all of that was ignored at the embedding stage. As a result, searches repeatedly failed to surface relevant information despite it being present in the knowledge base.

Gemini Embedding 2 addresses this problem at its root.

Gemini Embedding 2 Key Specs

Input Modalities

Modality	Coverage	Constraints
Text	Up to 8,192 tokens	100+ languages supported
Images	Up to 6 per request	PNG, JPEG
Video	Up to 120 seconds	MP4, MOV
Audio	Native processing	No intermediate text conversion needed
Documents	Complex docs like PDFs	Mixed text + image processing

Output Dimensions

The default output is a 3,072-dimensional vector. The key here is the application of Matryoshka Representation Learning (MRL). Like Russian nesting dolls, information is arranged in a nested structure, so core information is preserved in the higher dimensions even when the dimensionality is reduced.

3072 dimensions (highest precision)
 └── 1536 dimensions (high precision)
      └── 768 dimensions (general purpose)
           └── 256 dimensions (lightweight, mobile/edge)

The reason this matters in practice is that it allows flexible tuning of the cost-accuracy tradeoff. When indexing millions of documents, a two-stage strategy becomes feasible: first-pass filtering with 256 dimensions, then re-ranking top candidates with 3,072 dimensions.

API Access

Two gateways are available:

Gemini API (AI Studio): For prototyping and individual developers. Includes a free tier.
Vertex AI (Google Cloud): Enterprise scale. VPC-SC, CMEK, and IAM integration.

Comparison with Existing Embedding Models

graph TD
    subgraph Legacy["Legacy Pipeline"]
        T1["Text Embedding<br/>(text-embedding-3)"] --> VS1["Vector Store"]
        I1["Image Embedding<br/>(CLIP)"] --> VS2["Separate Vector Store"]
        A1["Audio<br/>(Whisper → Text)"] --> T1
    end
    subgraph New["Gemini Embedding 2"]
        T2["Text"] --> GE["Unified Embedding Model"]
        I2["Image"] --> GE
        V2["Video"] --> GE
        A2["Audio"] --> GE
        GE --> VS3["Single Vector Store"]
    end

Three problems with the legacy approach:

Pipeline complexity: Each modality required a separate model, separate store, and separate retrieval logic
No cross-modal search: Queries like “find code related to this diagram” were impossible
Intermediate conversion loss: Converting audio to text lost nuance and context

Embedding Model Spec Comparison

Model	Modalities	Max Dimensions	MRL	Price (per 1M tokens)
OpenAI text-embedding-3-large	Text only	3,072	Yes	$0.13
Cohere embed-v4	Text + Image	1,024	Yes	$0.10
Gemini Embedding 2	Text + Image + Video + Audio	3,072	Yes	Free (preview)
Voyage AI voyage-3	Text only	1,024	No	$0.06

Gemini Embedding 2’s differentiator is clear. It is the only model to natively support all four modalities, with top-tier output dimensions, and is currently free during the preview period.

Practical Application: Building a Multimodal RAG Pipeline

Architecture Design

graph TD
    subgraph Ingestion["Data Ingestion"]
        DOC["Internal Docs<br/>(PDF, Wiki)"]
        IMG["Images<br/>(Diagrams, Screenshots)"]
        VID["Meeting Recordings<br/>(MP4)"]
        AUD["Customer Calls<br/>(Voice)"]
    end
    subgraph Embedding["Embedding Processing"]
        DOC --> GE2["Gemini Embedding 2<br/>API"]
        IMG --> GE2
        VID --> GE2
        AUD --> GE2
    end
    subgraph Storage["Vector Storage"]
        GE2 --> PG["pgvector /<br/>Pinecone /<br/>Weaviate"]
    end
    subgraph Retrieval["Retrieval & Generation"]
        Q["User Query"] --> QE["Query Embedding"]
        QE --> PG
        PG --> RR["Re-ranking"]
        RR --> LLM["Gemini Pro /<br/>Claude"]
        LLM --> ANS["Response"]
    end

Code Example: Python SDK

from google import genai

# Initialize client
client = genai.Client(api_key="YOUR_API_KEY")

# Text embedding
text_result = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents=["Key clauses from the internal security policy document"],
    config={
        "output_dimensionality": 768,  # Reduce dimensions via MRL
        "task_type": "RETRIEVAL_DOCUMENT"
    }
)
print(f"Text vector dimensions: {len(text_result.embeddings[0].values)}")
# Output: Text vector dimensions: 768

# Image embedding (same vector space)
from google.genai import types

image = types.Part.from_uri(
    file_uri="gs://my-bucket/architecture-diagram.png",
    mime_type="image/png"
)
image_result = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents=[image]
)

# Cosine similarity between text and image vectors is now possible
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity = cosine_similarity(
    text_result.embeddings[0].values,
    image_result.embeddings[0].values
)
print(f"Text-image similarity: {similarity:.4f}")

Task Type Strategy

Gemini Embedding 2 lets you specify the embedding purpose using the task_type parameter:

Task Type	Purpose	Use Case
`RETRIEVAL_DOCUMENT`	Document indexing	When storing RAG documents
`RETRIEVAL_QUERY`	Query encoding	When processing user search queries
`SEMANTIC_SIMILARITY`	Similarity comparison	Duplicate detection, clustering
`CLASSIFICATION`	Classification	Automatic document classification, labeling
`CLUSTERING`	Clustering	Topic modeling, grouping

Pro tip: Always use different task types for indexing and retrieval. Using RETRIEVAL_DOCUMENT when storing documents and RETRIEVAL_QUERY when querying significantly improves asymmetric retrieval performance.

EM/CTO Perspective: Adoption Considerations

1. Pipeline Simplification = Reduced Operational Costs

The most direct benefit of adopting multimodal embeddings is reduced pipeline complexity.

If you currently operate separate embedding pipelines per modality:

3–4 models → 1 model
2–3 vector stores → 1 vector store
Synchronization logic eliminated
Fewer systems to monitor

According to Google’s official blog, some customers have achieved 70% latency reduction.

2. Vendor Dependency Assessment

Gemini Embedding 2 is currently Google-exclusive. For organizations running a multi-cloud strategy:

Abstract the embedding layer: Design the embedding model as a swappable interface
Vector format compatibility: 3,072-dimension vectors are compatible with most vector databases
Leverage MRL: Dimension reduction makes it possible to match dimensions with other models

3. Data Governance

Sending multimodal data to an external API introduces governance considerations:

On Vertex AI, VPC Service Controls can define data perimeters
CMEK (Customer-Managed Encryption Keys) is supported
PII masking is recommended before embedding meeting recordings and customer call audio
If Data Residency requirements apply, confirm region selection carefully

4. Cost Model Forecasting

The model is currently free during preview, but billing is expected after GA. Cost optimization strategy:

Indexing:     256 dimensions (MRL)  → 87% storage cost reduction vs. 3072
First pass:   256-dim ANN search    → fast and inexpensive
Re-ranking:   3072-dim exact match  → top 50 candidates only

This two-stage strategy simultaneously optimizes cost and accuracy at the scale of millions of documents.

Production Migration Checklist

When transitioning from a text-only RAG to a multimodal RAG:

Data inventory: Assess the current state of non-text data in your organization (images, video, audio)
Prioritization: Apply multimodal indexing first to document types with the highest search failure rate
Vector DB compatibility: Confirm your existing vector store supports 3,072 dimensions (pgvector, Pinecone, and Weaviate all do)
A/B testing: Quantitatively compare retrieval accuracy between existing text-only and multimodal embeddings
Monitoring: Track cross-modal search rate, latency, and embedding API call volume
Security review: Obtain security and compliance approval for external transmission of multimodal data

Conclusion

Gemini Embedding 2 is not simply “a new embedding model.” It is a paradigm shift in RAG pipeline architecture.

Search systems that could previously only handle text can now perform unified retrieval across images, video, and audio within the same vector space. This is not just a technical advancement — it is a change that can fundamentally transform how organizations leverage unstructured data.

Key action items from an Engineering Manager’s perspective:

Now: Run a PoC using the Gemini API during the preview period (free)
Within 1–2 weeks: Create an inventory of non-text data within your organization
Within 1 month: Design an A/B test comparing multimodal RAG against your existing RAG pipeline

Reading Complete!

Gemini Embedding 2 — How Multimodal Embeddings Change RAG

Why Multimodal Embeddings

Gemini Embedding 2 Key Specs

Input Modalities

Output Dimensions

API Access

Comparison with Existing Embedding Models

Embedding Model Spec Comparison

Practical Application: Building a Multimodal RAG Pipeline

Architecture Design

Code Example: Python SDK

Task Type Strategy

EM/CTO Perspective: Adoption Considerations

1. Pipeline Simplification = Reduced Operational Costs

2. Vendor Dependency Assessment

3. Data Governance

4. Cost Model Forecasting

Production Migration Checklist

Conclusion

References

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Reading Complete!

Why Multimodal Embeddings

Gemini Embedding 2 Key Specs

Input Modalities

Output Dimensions

API Access

Comparison with Existing Embedding Models

Single-Modal vs. Multimodal

Embedding Model Spec Comparison

Practical Application: Building a Multimodal RAG Pipeline

Architecture Design

Code Example: Python SDK

Task Type Strategy

EM/CTO Perspective: Adoption Considerations

1. Pipeline Simplification = Reduced Operational Costs

2. Vendor Dependency Assessment

3. Data Governance

4. Cost Model Forecasting

Production Migration Checklist

Conclusion

References

Read in Other Languages

Was this helpful?

About the Author

Kim Jangwook

Related Articles

AI Agent KPI Pressure and Ethics Violations — What 12-Model Testing Reveals About Goal-Driven AI

CCC vs GCC — How Good Is an AI-Written C Compiler, Really?

Anthropic's Big AI Ecosystem Play — Institute & $100M Fund