
Google releases its first native multimodal embedding model, Gemini Embedding 2

Google DeepMind launched its first native multimodal embedding model, Gemini Embedding 2, on March 10, which can unify text, images, videos, audio, and documents into a single embedding space. The model supports over 100 languages and introduces native voice embedding capabilities for the first time, eliminating the need for an intermediate step of converting speech to text. It uses MRL technology to support flexible compression of vector dimensions, balancing performance and storage costs
On March 10, Google's DeepMind launched Gemini Embedding 2, the company's first native multimodal embedding model that unifies text, images, videos, audio, and documents into a single embedding space, marking a new stage in AI embedding technology towards full-modal integration.

Gemini Embedding 2 supports semantic understanding in over 100 languages and surpasses existing mainstream models in benchmark tests for text, image, and video tasks, while also introducing speech processing capabilities that were previously lacking in embedding models.
The model is now in public preview through the Gemini API and Vertex AI, allowing developers to access it immediately.
For enterprise users, the release of this model directly lowers the technical barriers to building multimodal retrieval-augmented generation (RAG), semantic search, and data classification systems, promising to simplify the complex data pipelines that previously required separate handling across modalities.
Full-modal Unification: Expanding from Text to Five Media Forms
Gemini Embedding 2 is built on the Gemini architecture, extending embedding capabilities from pure text to five types of input forms:
- Text supports up to 8192 input tokens;
- Images can process up to 6 images per request, supporting PNG and JPEG formats;
- Videos support MP4 and MOV files up to 120 seconds long;
- Audio can be directly ingested and generate embedding vectors without the need for intermediate text transcription steps;
- Documents support direct embedding of PDF files up to 6 pages.
Unlike the traditional method of processing single modalities one by one, this model supports interleaved input, allowing multiple modality combinations such as images and text to be passed in a single request, enabling the model to capture complex and subtle semantic relationships between different media types.
Gemini Embedding 2 continues the use of Matryoshka Representation Learning (MRL) technology from Google's previous embedding models. This technology dynamically compresses vector dimensions through a "nested" approach, allowing the output dimensions to be flexibly reduced from the default 3072, helping developers strike a balance between model performance and storage costs.
Leading Benchmark Tests, Speech Capability as a New Highlight
Google stated that Gemini Embedding 2 outperforms current mainstream competitive models in benchmark tests for text, image, and video tasks, positioning it as a new performance benchmark in the field of multimodal embedding

Google recommends developers choose from three dimensions: 3072, 1536, or 768, based on the application scenario, to achieve the highest quality embedding effect. This design is particularly important for enterprises that require large-scale deployment of embedding vectors, as it effectively controls infrastructure costs without significantly sacrificing accuracy.
In terms of capability coverage, this model introduces native voice embedding capabilities that were generally lacking in previous similar models, allowing direct processing of audio data without the need for an intermediate step of converting speech to text.
Google points out that embedding technology has been widely applied in several of its products, covering context engineering in RAG scenarios, large-scale data management, as well as traditional search and analysis scenarios.
Some early access partners have already begun building multimodal applications based on Gemini Embedding 2, and Google states that these use cases are realizing the actual potential of the model in high-value scenarios
