ColNomic Embed Multimodal 7B Open-Source Model - Supports Multiple Languages and Enables Efficient Visual Document Retrieval

Colnomic Embed Multimodal 7b

Developed by nomic-ai

ColNomic Embed Multimodal 7B is a state-of-the-art multi-vector multimodal embedding model, excelling in visual document retrieval tasks with support for multilingual and unified text-image encoding.

Multimodal Fusion

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multimodal Document Retrieval #Multilingual Visual Embedding #Unified Text-Image Encoding

Downloads 7,909

Release Time : 3/31/2025

Model Overview

This 7-billion-parameter multimodal embedding model is specifically designed for visual document retrieval tasks, capable of directly encoding interleaved text and images without complex preprocessing.

Model Features

High Performance

Achieves 62.7 NDCG@5 on Vidore-v2, surpassing all other models

Unified Text-Image Encoding

Directly encodes interleaved text and images without complex preprocessing

Advanced Architecture

7-billion-parameter multimodal embedding model

Fully Open Source

Provides model weights, training data, and code

Multilingual Support

Supports English, Italian, French, German, and Spanish

Model Capabilities

Visual Document Retrieval

Multimodal Embedding

Multilingual Embedding

Text-to-Visual Document Retrieval

Use Cases

Research Papers

Capturing Formulas, Charts, and Tables

Used for retrieving academic papers containing complex scientific formulas and charts

Improved retrieval accuracy

Technical Documentation

Encoding Code Blocks, Flowcharts, and Screenshots

Used for retrieving code examples and system architecture diagrams in technical documents

More precise technical content retrieval

Product Catalogs

Product Image Retrieval

Retrieve relevant product images based on product descriptions

Enhanced e-commerce experience

Financial Reports

Embedding Charts, Graphs, and Numerical Data

Used for retrieving key data visualizations in financial reports

Quickly locate key financial metrics

🚀 ColNomic Embed Multimodal 7B: State-of-the-Art Visual Document Retrieval

colnomic-embed-multimodal-7b is a multi-vector state-of-the-art multimodal embedding model designed for visual document retrieval. It offers high performance, unified text-image encoding, an advanced architecture, and is fully open-source.

✨ Features

High Performance: Achieves 62.7 NDCG@5 on Vidore-v2, outperforming all other models.
Unified Text-Image Encoding: Directly encodes interleaved text and images without complex preprocessing.
Advanced Architecture: A 7B parameter multimodal embedding model.
Fully Open-Source: Model weights, training data, and code are all available.

📦 Installation

To use colnomic-embed-multimodal-7b, you need to install colpali from source:

pip install git+https://github.com/illuin-tech/colpali.git

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model_name = "nomic-ai/colnomic-embed-multimodal-7b"

model = ColQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2_5_Processor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

📚 Documentation

Model Architecture

Property	Details
Total Parameters	7B
Training Approach	Fine-tuned from Qwen2.5-VL 7B Instruct
Architecture Type	Vision-Language Model with unified text and image input processing
Key Innovations	1. Same-source sampling to create harder in-batch negatives 2. Multi-vector output option for enhanced performance

Integration with RAG Workflows

Nomic Embed Multimodal 7B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows:

Direct Document Embedding: Skip OCR and complex processing by directly embedding document page images.
Faster Processing: Eliminate preprocessing steps for quicker indexing.
More Complete Information: Capture both textual and visual cues in a single embedding.
Simple Implementation: Use the same API for both text and images.

Recommended Use Cases

The model is well-suited for real-world document retrieval scenarios that challenge traditional text-only systems:

Research Papers: Capture equations, diagrams, and tables.
Technical Documentation: Encode code blocks, flowcharts, and screenshots.
Product Catalogs: Represent images, specifications, and pricing tables.
Financial Reports: Embed charts, graphs, and numerical data.
Visually Rich Content: Where layout and visual information are important.
Multilingual Documents: Where visual context provides important cues.

Training Details

ColNomic Embed Multimodal 7B was developed with several key innovations:

Sampling From the Same Source: Forcing sampling from the same dataset source creates harder in-batch negatives, preventing the model from learning dataset artifacts.
Multi-Vector Configuration: Providing a multi-vector variant that achieves higher performance than the dense variant.

Limitations

Performance may vary when processing documents with unconventional layouts or unusual visual elements.
While it can handle multiple languages, performance is strongest on English content.
Processing very large or complex documents may require dividing them into smaller chunks.
Performance on documents with handwriting or heavily stylized fonts may be reduced.

🔧 Technical Details

The model has the following technical details:

Base Model: Qwen/Qwen2.5-VL-7B-Instruct
Library Name: peft
Datasets:
- llamaindex/vdr-multilingual-train
- nomic-ai/colpali_train_set_split_by_source
Language: en, it, fr, de, es
Pipeline Tag: visual-document-retrieval
Tags: vidore, colpali, multimodal_embedding, multilingual_embedding, Text-to-Visual Document (T→VD) retrieval

📄 License

This project is licensed under the apache-2.0 license.

Performance

Model	Avg.	ESG Restaurant Human	Econ Macro Multi.	AXA Multi.	MIT Bio	ESG Restaurant Synth.	ESG Restaurant Synth. Multi.	MIT Bio Multi.	AXA	Econ. Macro
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T-Systems ColQwen2.5-3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr-2b-multi-v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

Join the Nomic Community

Nomic Embed Ecosystem: https://www.nomic.ai/embed
Website: https://nomic.ai
Twitter: https://twitter.com/nomic_ai
Discord: https://discord.gg/myY5YDR8z8

Citation

If you find this model useful in your research or applications, please consider citing:

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご