nomic-embed-multimodal-7b開源多模態嵌入模型 - 免費助力視覺文檔高效檢索

Home

Nomic Embed Multimodal 7b

Developed by nomic-ai

70億參數的多模態嵌入模型，專精於視覺文檔檢索任務，在Vidore-v2基準測試中表現卓越

文本生成圖像

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #圖文統一編碼 #視覺文檔檢索 #多語言嵌入

Downloads 741

Release Time : 3/29/2025

Model Overview

一款性能卓越的密集多模態嵌入模型，能夠直接處理交錯排列的文本與圖像，無需複雜預處理，特別適合視覺文檔檢索任務

Model Features

卓越性能

在Vidore-v2基準測試中取得58.8 NDCG@5，超越所有其他密集多模態嵌入模型

圖文統一編碼

直接處理交錯排列的文本與圖像，無需複雜預處理

先進架構

70億參數的多模態嵌入模型

完全開源

提供模型權重、訓練數據和完整代碼

Model Capabilities

視覺文檔檢索

多模態嵌入

多語言處理

圖文統一編碼

Use Cases

科研領域

科研論文檢索

處理包含公式、圖表和數據的科研論文

有效檢索複雜學術內容

技術文檔

技術文檔管理

編碼代碼塊、流程圖和截圖等技術文檔內容

提升技術文檔檢索效率

商業應用

產品目錄檢索

呈現產品圖、規格參數和價目表

改善電子商務體驗

財務報告分析

嵌入走勢圖、柱狀圖和數值數據

加速財務數據分析

🚀 Nomic Embed Multimodal 7B：先進的視覺文檔檢索模型

nomic-embed-multimodal-7b 是一款先進的密集多模態嵌入模型，在視覺文檔檢索任務中表現卓越：

高性能：在 Vidore-v2 上實現了 58.8 的 NDCG@5，超越了所有其他密集多模態嵌入模型。
統一的文本 - 圖像編碼：無需複雜的預處理，可直接對交錯的文本和圖像進行編碼。
先進的架構：擁有 70 億參數的多模態嵌入模型。
完全開源：模型權重、訓練數據和代碼均公開可用。

🚀 快速開始

若要使用 nomic-embed-multimodal-7b，請從源代碼安裝 colpali：

pip install git+https://github.com/illuin-tech/colpali.git

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import BiQwen2_5, BiQwen2_5_Processor

model_name = "nomic-ai/nomic-embed-multimodal-7b"

model = BiQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # 若使用蘋果硅芯片，則為 "mps"
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = BiQwen2_5_Processor.from_pretrained(model_name)

# 輸入數據
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# 處理輸入
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# 前向傳播
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score(list(torch.unbind(query_embeddings)), list(torch.unbind(image_embeddings)))

✨ 主要特性

性能表現

模型	平均得分	ESG 餐廳人工數據	經濟宏觀多模態數據	AXA 多模態數據	MIT 生物數據	ESG 餐廳合成數據	ESG 餐廳合成多模態數據	MIT 生物多模態數據	AXA 數據	經濟宏觀數據
ColNomic Embed Multimodal 7B	62.7	73.9	54.7	61.3	66.1	57.3	56.7	64.2	68.3	61.6
ColNomic Embed Multimodal 3B	61.2	65.8	55.4	61.0	63.5	56.6	57.2	62.5	68.8	60.2
T-Systems ColQwen2.5 - 3B	59.9	72.1	51.2	60.0	65.3	51.7	53.3	61.7	69.3	54.8
Nomic Embed Multimodal 7B	59.7	65.7	57.7	59.3	64.0	49.2	51.9	61.2	66.3	63.1
GME Qwen2 7B	59.0	65.8	56.2	55.4	64.0	54.3	56.7	55.1	60.7	62.9
Nomic Embed Multimodal 3B	58.8	59.8	57.5	58.8	62.5	49.4	49.4	58.6	69.6	63.5
Llama Index vdr - 2b - multi - v1	58.4	63.1	52.8	61.0	60.6	50.3	51.2	56.9	68.8	61.2
Voyage Multimodal 3	55.0	56.1	55.0	59.5	56.4	47.2	46.2	51.5	64.1	58.8

模型架構

總參數：70 億
訓練方式：基於 Qwen2.5 - VL 7B Instruct 進行微調
架構類型：具有統一文本和圖像輸入處理的視覺 - 語言模型
關鍵創新點：
- 同來源採樣以創建更具挑戰性的批次內負樣本
- 採用正樣本感知技術進行難負樣本挖掘

與 RAG 工作流的集成

Nomic Embed Multimodal 7B 可無縫集成到檢索增強生成（RAG）工作流中：

直接文檔嵌入：直接嵌入文檔頁面圖像，跳過 OCR 和複雜處理。
更快的處理速度：消除預處理步驟，實現更快的索引。
更完整的信息：在單個嵌入中捕獲文本和視覺線索。
簡單的實現方式：對文本和圖像使用相同的 API。

訓練細節

Nomic Embed Multimodal 7B 通過以下幾個關鍵創新點進行開發：

同來源採樣：強制從同一數據集來源採樣，創建更具挑戰性的批次內負樣本，防止模型學習數據集的偽特徵。
難負樣本挖掘：使用初始模型為每個查詢檢索前 k 個最近鄰，然後將這些難負樣本納入訓練。
正樣本感知難負樣本挖掘：使用 NV - Retriever 中引入的技術減少假陰性。

🔧 技術細節

模型基礎信息

屬性	詳情
基礎模型	Qwen/Qwen2.5 - VL - 7B - Instruct
庫名稱	peft
數據集	nomic - ai/colpali - queries - mined - 20250321 - by - source
支持語言	英語、意大利語、法語、德語、西班牙語
任務類型	視覺文檔檢索
標籤	vidore、colpali、multimodal_embedding、multilingual_embedding、Text - to - Visual Document (T→VD) retrieval

📄 許可證

本項目採用 Apache 2.0 許可證。

⚠️ 侷限性

處理具有非常規佈局或不尋常視覺元素的文檔時，性能可能會有所不同。
雖然支持多種語言，但在英語內容上的性能最強。
處理非常大或複雜的文檔時，可能需要將其分割成較小的塊。
處理包含手寫體或高度風格化字體的文檔時，性能可能會降低。

👥 加入 Nomic 社區

Nomic Embed 生態系統：https://www.nomic.ai/embed
官方網站：https://nomic.ai
Twitter：https://twitter.com/nomic_ai
Discord：https://discord.gg/myY5YDR8z8

📚 引用

如果您在研究或應用中發現此模型有用，請考慮引用以下文獻：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}