ColNomic Embed Multimodal 3B Open-Source Model - Supports Multiple Languages and Facilitates Visual Document Retrieval Tasks

Colnomic Embed Multimodal 3b

Developed by nomic-ai

ColNomic Embed Multimodal 3B is a 3-billion-parameter multimodal embedding model specifically designed for visual document retrieval tasks, supporting unified encoding of multilingual text and images.

Multimodal Fusion

Safetensors

Supports Multiple Languages#Visual Document Retrieval #Multimodal Embedding #Multilingual Support

Downloads 4,636

Release Time : 3/27/2025

Model Overview

This model excels in visual document retrieval tasks, capable of directly encoding interleaved text and images without complex preprocessing, making it suitable for various document retrieval scenarios.

Model Features

High-Performance Visual Document Retrieval

Achieves 61.2 NDCG@5 on Vidore-v2, second only to ColNomic Embed Multimodal 7B.

Unified Text-Image Encoding

Directly encodes interleaved text and images without complex preprocessing.

Multilingual Support

Supports multiple languages including English, Italian, French, German, and Spanish.

Multi-Vector Output

Provides multi-vector output options to enhance performance.

Model Capabilities

Text Encoding

Image Encoding

Multimodal Retrieval

Multilingual Processing

Use Cases

Research Paper Retrieval

Capturing Formulas and Diagrams

Retrieve research papers containing specific formulas or diagrams.

Accurately identifies and retrieves documents with complex scientific content.

Technical Documentation Management

Code Block and Flowchart Retrieval

Search for specific code blocks or flowcharts in technical documents.

Effectively identifies code and visual elements in technical documentation.

Financial Report Analysis

Chart and Data Retrieval

Accurately identifies key data visualization content in financial reports.

🚀 ColNomic Embed Multimodal 3B: State-of-the-Art Visual Document Retrieval

colnomic-embed-multimodal-3b is a multi-vector state-of-the-art multimodal embedding model that shines in visual document retrieval tasks. It can solve the problem of efficiently retrieving visual documents and provides high - performance and unified encoding solutions.

Model Information

Property	Details
Base Model	Qwen/Qwen2.5-VL-3B-Instruct
Library Name	peft
Training Datasets	llamaindex/vdr - multilingual - train, nomic - ai/colpali_train_set_split_by_source
Supported Languages	en, it, fr, de, es
Pipeline Tag	visual - document - retrieval
Tags	vidore, colpali, multimodal_embedding, multilingual_embedding, Text - to - Visual Document (T→VD) retrieval

✨ Features

High Performance: Achieves 61.2 NDCG@5 on Vidore - v2, outperforming all other models except ColNomic Embed Multimodal 7B.
Unified Text - Image Encoding: Directly encodes interleaved text and images without complex preprocessing.
Advanced Architecture: A 3B parameter multimodal embedding model.
Open - Weights: Model weights are available for research use.

📦 Installation

To use colnomic-embed-multimodal-3b, please install colpali from source:

pip install git+https://github.com/illuin-tech/colpali.git

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model_name = "nomic-ai/colnomic-embed-multimodal-3b"

model = ColQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2_5_Processor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

📚 Documentation

Performance

Model	Avg.	ESG Restaurant Human	Econ Macro Multi.	AXA Multi.	MIT Bio	ESG Restaurant Synth.	ESG Restaurant Synth. Multi.	MIT Bio Multi.	AXA	Econ. Macro
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T - Systems ColQwen2.5 - 3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr - 2b - multi - v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

Model Architecture

Total Parameters: 3B
Training Approach: Fine - tuned from Qwen2.5 - VL 3B Instruct
Architecture Type: Vision - Language Model with unified text and image input processing
Key Innovations:
- Same - source sampling to create harder in - batch negatives
- Multi - vector output option for enhanced performance

Integration with RAG Workflows

Nomic Embed Multimodal 3B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows:

Direct Document Embedding: Skip OCR and complex processing by directly embedding document page images.
Faster Processing: Eliminate preprocessing steps for quicker indexing.
More Complete Information: Capture both textual and visual cues in a single embedding.
Simple Implementation: Use the same API for both text and images.

Recommended Use Cases

The model excels at handling real - world document retrieval scenarios that challenge traditional text - only systems:

Research Papers: Capture equations, diagrams, and tables.
Technical Documentation: Encode code blocks, flowcharts, and screenshots.
Product Catalogs: Represent images, specifications, and pricing tables.
Financial Reports: Embed charts, graphs, and numerical data.
Visually Rich Content: Where layout and visual information are important.
Multilingual Documents: Where visual context provides important cues.

Training Details

ColNomic Embed Multimodal 3B was developed through several key innovations:

Sampling From the Same Source: Forcing sampling from the same dataset source creates harder in - batch negatives, preventing the model from learning dataset artifacts.
Multi - Vector Configuration: Providing a multi - vector variant that achieves higher performance than the dense variant.

Limitations

Performance may vary when processing documents with unconventional layouts or unusual visual elements.
While it handles multiple languages, performance is strongest on English content.
Processing very large or complex documents may require dividing them into smaller chunks.
Performance on documents with handwriting or heavily stylized fonts may be reduced.

🔧 Technical Details

The model's development involves two key innovative techniques. First, the same - source sampling method creates more challenging in - batch negatives, which helps the model avoid learning dataset - specific artifacts. Second, the multi - vector configuration offers a variant that outperforms the dense variant, enhancing the overall performance of the model.

📄 License

No license information is provided in the original README.

🔗 Join the Nomic Community

Nomic Embed Ecosystem: https://www.nomic.ai/embed
Website: https://nomic.ai
Twitter: https://twitter.com/nomic_ai
Discord: https://discord.gg/myY5YDR8z8

📖 Citation

If you find this model useful in your research or applications, please consider citing:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご