colqwen2-v1.0-hf開源視覺檢索模型 - 免費生成文本與圖像多向量表徵

首頁

Colqwen2 V1.0 Hf

由vidore開發

基於Qwen2-VL-2B-Instruct與ColBERT策略的視覺檢索模型，能生成文本與圖像的多向量表徵

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #文檔視覺檢索 #多向量表徵 #PDF解析

下載量 61

發布時間 : 2/11/2025

模型概述

ColQwen2是一種新型的視覺語言模型，專為文檔視覺特徵索引設計。它擴展自Qwen2-VL-2B模型，採用ColBERT式多向量表徵策略，適用於高效的文檔檢索任務。

模型特點

多向量表徵

採用ColBERT策略生成文本與圖像的多向量表徵，提高檢索精度

視覺語言融合

結合視覺與語言特徵，實現跨模態文檔檢索

高效檢索

通過延遲交互機制優化檢索效率

模型能力

文檔視覺特徵提取

跨模態檢索

文本-圖像匹配

多向量表徵生成

使用案例

文檔管理

企業文檔檢索

快速查找公司內部文檔中的特定信息

提高文檔檢索效率和準確性

學術文獻搜索

在大量PDF論文中定位相關內容

加速研究過程

知識管理

知識庫構建

為知識庫系統提供高效的檢索能力

改善知識獲取體驗

🚀 ColQwen2：基於Qwen2-VL-2B-Instruct與ColBERT策略的視覺檢索器

ColQwen2是一個基於視覺語言模型（VLMs）的新型模型架構和訓練策略的模型，可從文檔的視覺特徵中高效索引文檔。它是Qwen2-VL-2B的擴展，能生成ColBERT風格的文本和圖像多向量表示。該模型在論文ColPali: Efficient Document Retrieval with Vision Language Models中被提出，並首次在此倉庫中發佈。

HuggingFace transformers 🤗 的實現由Tony Wu (@tonywu71) 和Yoni Gozlan (@yonigozlan) 貢獻。

🚀 快速開始

重要提示

⚠️ 重要提示

實驗性：在使用前，請等待 https://github.com/huggingface/transformers/pull/35778 合併！

⚠️ 重要提示

此版本的ColQwen2應使用 transformers 🤗 版本加載，而不是 colpali-engine。它是使用 convert_colqwen2_weights_to_hf.py 腳本從 vidore/colqwen2-v1.0-merged 檢查點轉換而來。

✨ 主要特性

ColQwen2基於視覺語言模型（VLMs）的新型架構和訓練策略，能夠從文檔的視覺特徵中高效索引文檔。它是Qwen2-VL-2B的擴展，可生成ColBERT風格的文本和圖像多向量表示。

📚 詳細文檔

模型描述

閱讀 transformers 🤗 模型卡片：https://huggingface.co/docs/transformers/en/model_doc/colqwen2。

模型訓練

數據集

我們的訓練數據集包含127,460個查詢 - 頁面對，由公開可用的學術數據集的訓練集（63%）和一個合成數據集組成。合成數據集由網絡爬取的PDF文檔頁面構成，並使用VLM生成（Claude-3 Sonnet）的偽問題進行擴充（37%）。我們的訓練集設計為全英文，以便研究對非英語語言的零樣本泛化能力。我們明確驗證了沒有多頁PDF文檔同時用於 ViDoRe 和訓練集，以防止評估汙染。使用2%的樣本創建驗證集來調整超參數。

💻 使用示例

基礎用法

import torch
from PIL import Image

from transformers import ColQwen2ForRetrieval, ColQwen2Processor
from transformers.utils.import_utils import is_flash_attn_2_available


model_name = "vidore/colqwen2-v1.0-hf"

model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

# Your inputs (replace dummy images with screenshots of your documents)
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images).embeddings
    query_embeddings = model(**batch_queries).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

🔧 技術細節

侷限性

聚焦範圍：該模型主要聚焦於PDF類型文檔和資源豐富的語言，可能限制其對其他文檔類型或代表性不足語言的泛化能力。
支持情況：該模型依賴於從ColBERT後期交互機制派生的多向量檢索，可能需要工程努力才能適應缺乏原生多向量支持的廣泛使用的向量檢索框架。

📄 許可證

ColQwen2的視覺語言骨幹模型（Qwen2-VL）遵循 apache-2.0 許可證。ColQwen2繼承了此 apache-2.0 許可證。

📞 聯繫信息

Manuel Faysse: manuel.faysse@illuin.tech
Hugues Sibille: hugues.sibille@illuin.tech
Tony Wu: tony.wu@illuin.tech

📚 引用

如果您在研究中使用了該組織的任何數據集或模型，請按以下方式引用原始數據集：

@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}