MonoQwen2-VL-v0.1開源多模態重排器 - 精準評估圖像與查詢相關性

首頁

Monoqwen2 VL V0.1

由lightonai開發

MonoQwen2-VL-v0.1 是一個基於 Qwen2-VL-2B 微調的多模態重排器，用於評估圖像與查詢的相關性。

圖像生成文本開源協議:Apache-2.0 #多模態重排 #視覺文檔檢索 #LoRA微調

下載量 547

發布時間 : 10/25/2024

模型概述

該模型通過 LoRA 微調優化了圖像與查詢的逐點相關性判斷，能夠生成 True 或 False 的響應，並計算相關性得分，適用於對檢索結果進行重排序或過濾。

模型特點

多模態重排

支持對圖像和文本查詢的相關性進行評估，生成 True 或 False 的響應。

LoRA 微調

基於 Qwen2-VL-2B 模型通過 LoRA 進行高效微調，優化相關性判斷任務。

高性能

在 ViDoRe 基準測試中表現優異，顯著提升檢索結果的 ndcg@5 得分。

模型能力

圖像與文本相關性評估

多模態檢索結果重排序

生成 True/False 響應

使用案例

信息檢索

文檔檢索重排序

對第一階段檢索器（如 DSE 或 ColPali）生成的候選結果進行重排序，提升檢索質量。

在 ViDoRe 基準測試中，ndcg@5 得分平均提升 4.7%。

圖像過濾

圖像相關性過濾

通過設定閾值過濾與查詢不相關的圖像，提升檢索效率。

🚀 MonoQwen2-VL-v0.1

MonoQwen2-VL-v0.1 是一個多模態重排器，它基於 Qwen2-VL-2B 利用 LoRA 進行微調。該模型針對使用 MonoT5 目標來確定圖像與查詢的逐點相關性進行了優化。也就是說，將一組圖像和查詢輸入到視覺語言模型（VLM）的提示中，如果圖像與查詢相關，模型將生成 "True"，否則生成 "False"。在推理過程中，可以通過比較這兩個標記的對數概率來獲得相關性得分，該得分可有效用於對第一階段檢索器（如 DSE 或 ColPali）生成的候選結果進行重排序，或使用閾值對其進行過濾。

🚀 快速開始

下面是使用該模型對單張圖像與用戶查詢進行重排序的快速示例：

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# 加載處理器和模型
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# 定義查詢並加載圖像
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# 構建提示並準備輸入
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# 應用聊天模板並進行分詞
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# 運行推理以獲得對數概率
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# 轉換標記並計算相關性得分
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# 提取並顯示概率
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

這個示例展示瞭如何使用該模型評估圖像與查詢的相關性。它輸出圖像相關（"True"）或不相關（"False"）的概率。

⚠️ 重要提示

此示例要求在您的環境中安裝 peft（pip install peft）。如果您不想使用 peft，可以在原始的 Qwen2-VL-2B 模型上使用 model.load_adapter。

💻 使用示例

基礎用法

# 上述示例代碼即為基礎用法示例
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# 加載處理器和模型
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# 定義查詢並加載圖像
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# 構建提示並準備輸入
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# 應用聊天模板並進行分詞
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# 運行推理以獲得對數概率
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# 轉換標記並計算相關性得分
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# 提取並顯示概率
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

性能指標

該模型在 ViDoRe 基準測試上進行了評估，使用 MrLight_dse-qwen2-2b-mrl-v1 檢索 10 個元素並進行重排序。下表總結了其 ndcg@5 得分：

數據集	MrLight_dse-qwen2-2b-mrl-v1	MonoQwen2-VL-v0.1 重排序
vidore/arxivqa_test_subsampled	85.6	89.0
vidore/docvqa_test_subsampled	57.1	59.7
vidore/infovqa_test_subsampled	88.1	93.2
vidore/tabfquad_test_subsampled	93.1	96.0
vidore/shiftproject_test	82.0	93.0
vidore/syntheticDocQA_artificial_intelligence_test	97.5	100.0
vidore/syntheticDocQA_energy_test	92.9	97.7
vidore/syntheticDocQA_government_reports_test	96.0	98.0
vidore/syntheticDocQA_healthcare_industry_test	96.4	99.3
vidore/tatdqa_test	69.4	79.0
均值	85.8	90.5

📄 許可證

這個 LoRA 模型遵循 Apache 2.0 許可證。

引用

如果您發現該模型有用，請考慮引用我們的工作：

@misc{MonoQwen,
  title={MonoQwen: Visual Document Reranking},
  author={Chaffin, Antoine and Lac, Aur√©lien},
  url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
  year={2024}
}