ColNomic Embed Multimodal 7B开源模型 - 支持多语言，高效视觉文档检索

首页

Colnomic Embed Multimodal 7b

由 nomic-ai 开发

ColNomic Embed Multimodal 7B是一款多向量最先进的多模态嵌入模型，擅长视觉文档检索任务，支持多语言和统一文本图像编码。

多模态融合

Safetensors

支持多种语言开源协议:Apache-2.0 #多模态文档检索 #多语言视觉嵌入 #统一图文编码

下载量 7,909

发布时间 : 3/31/2025

模型简介

该模型是一款70亿参数的多模态嵌入模型，专为视觉文档检索任务设计，能够直接编码交错排列的文本和图像，无需复杂预处理。

模型特点

高性能

在Vidore-v2上达到62.7 NDCG@5，超越所有其他模型

统一文本图像编码

直接编码交错排列的文本和图像，无需复杂预处理

先进架构

70亿参数的多模态嵌入模型

完全开源

提供模型权重、训练数据和代码

多语言支持

支持英语、意大利语、法语、德语和西班牙语

模型能力

视觉文档检索

多模态嵌入

多语言嵌入

文本到视觉文档检索

使用案例

研究论文

捕获公式、图表和表格

用于检索包含复杂科学公式和图表的学术论文

提高检索准确率

技术文档

编码代码块、流程图和截图

用于检索技术文档中的代码示例和系统架构图

更精准的技术内容检索

产品目录

产品图像检索

根据产品描述检索相关产品图像

提升电子商务体验

财务报告

嵌入图表、图形和数值数据

用于检索财务报告中的关键数据可视化

快速定位关键财务指标

🚀 ColNomic Embed Multimodal 7B：领先的视觉文档检索模型

colnomic-embed-multimodal-7b 是一款多向量的先进多模态嵌入模型，在视觉文档检索任务中表现卓越：

高性能：在 Vidore-v2 上实现了 62.7 的 NDCG@5，超越了所有其他模型。
统一的文本 - 图像编码：无需复杂的预处理，可直接对交错的文本和图像进行编码。
先进架构：拥有 70 亿参数的多模态嵌入模型。
完全开源：模型权重、训练数据和代码均公开可用。

🚀 快速开始

若要使用 colnomic-embed-multimodal-7b，请从源代码安装 colpali：

pip install git+https://github.com/illuin-tech/colpali.git

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model_name = "nomic-ai/colnomic-embed-multimodal-7b"

model = ColQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # 若使用苹果芯片，可改为 "mps"
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2_5_Processor.from_pretrained(model_name)

# 输入数据
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# 处理输入
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# 前向传播
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

✨ 主要特性

性能卓越

模型	平均	ESG 餐厅人工数据	经济宏观多模态数据	AXA 多模态数据	MIT 生物数据	ESG 餐厅合成数据	ESG 餐厅合成多模态数据	MIT 生物多模态数据	AXA 数据	经济宏观数据
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T - Systems ColQwen2.5 - 3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr - 2b - multi - v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

架构优势

总参数：70 亿
训练方式：基于 Qwen2.5 - VL 7B Instruct 进行微调
架构类型：具备统一文本和图像输入处理能力的视觉 - 语言模型
关键创新：
- 同来源采样，以创建更具挑战性的批次内负样本。
- 提供多向量输出选项，以提升性能。

与 RAG 工作流集成

Nomic Embed Multimodal 7B 可无缝集成到检索增强生成（RAG）工作流中：

直接文档嵌入：直接对文档页面图像进行嵌入，跳过 OCR 和复杂处理。
更快处理速度：消除预处理步骤，实现更快的索引。
更完整信息：在单个嵌入中捕获文本和视觉线索。
简单实现：对文本和图像使用相同的 API。

训练细节

ColNomic Embed Multimodal 7B 通过以下关键创新得以开发：

同来源采样：强制从同一数据集来源进行采样，创建更具挑战性的批次内负样本，防止模型学习数据集的人工痕迹。
多向量配置：提供多向量变体，其性能优于密集变体。

局限性

处理具有非常规布局或不寻常视觉元素的文档时，性能可能会有所不同。
虽然支持多种语言，但在英语内容上的性能最强。
处理非常大或复杂的文档时，可能需要将其分割成较小的块。
处理包含手写体或高度风格化字体的文档时，性能可能会下降。

📦 安装指南

若要使用 colnomic-embed-multimodal-7b，请从源代码安装 colpali：

pip install git+https://github.com/illuin-tech/colpali.git

💻 使用示例

基础用法

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor

model_name = "nomic-ai/colnomic-embed-multimodal-7b"

model = ColQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # 若使用苹果芯片，可改为 "mps"
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2_5_Processor.from_pretrained(model_name)

# 输入数据
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# 处理输入
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# 前向传播
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

📚 详细文档

加入 Nomic 社区

Nomic Embed 生态系统：https://www.nomic.ai/embed
网站：https://nomic.ai
Twitter：https://twitter.com/nomic_ai
Discord：https://discord.gg/myY5YDR8z8

引用信息

如果您发现此模型在您的研究或应用中有用，请考虑引用：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}

📄 许可证

本项目采用 Apache - 2.0 许可证。

📦 模型信息

属性	详情
基础模型	Qwen/Qwen2.5 - VL - 7B - Instruct
库名称	peft
训练数据集	llamaindex/vdr - multilingual - train、nomic - ai/colpali_train_set_split_by_source
支持语言	英语、意大利语、法语、德语、西班牙语
任务类型	视觉文档检索
标签	vidore、colpali、multimodal_embedding、multilingual_embedding、Text - to - Visual Document (T→VD) retrieval
许可证	apache - 2.0