🚀 MonoQwen2-VL-v0.1
MonoQwen2-VL-v0.1 是一個多模態重排器,它基於 Qwen2-VL-2B 利用 LoRA 進行微調。該模型針對使用 MonoT5 目標來確定圖像與查詢的逐點相關性進行了優化。也就是說,將一組圖像和查詢輸入到視覺語言模型(VLM)的提示中,如果圖像與查詢相關,模型將生成 "True",否則生成 "False"。在推理過程中,可以通過比較這兩個標記的對數概率來獲得相關性得分,該得分可有效用於對第一階段檢索器(如 DSE 或 ColPali)生成的候選結果進行重排序,或使用閾值對其進行過濾。
🚀 快速開始
下面是使用該模型對單張圖像與用戶查詢進行重排序的快速示例:
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
"lightonai/MonoQwen2-VL-v0.1",
device_map="auto",
)
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)
prompt = (
"Assert the relevance of the previous image document to the following query, "
"answer True or False. The query is: {query}"
).format(query=query)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
logits_for_last_token = outputs.logits[:, -1, :]
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()
print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")
這個示例展示瞭如何使用該模型評估圖像與查詢的相關性。它輸出圖像相關("True")或不相關("False")的概率。
⚠️ 重要提示
此示例要求在您的環境中安裝 peft
(pip install peft
)。如果您不想使用 peft
,可以在原始的 Qwen2-VL-2B 模型上使用 model.load_adapter。
💻 使用示例
基礎用法
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
"lightonai/MonoQwen2-VL-v0.1",
device_map="auto",
)
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)
prompt = (
"Assert the relevance of the previous image document to the following query, "
"answer True or False. The query is: {query}"
).format(query=query)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
logits_for_last_token = outputs.logits[:, -1, :]
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()
print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")
性能指標
該模型在 ViDoRe 基準測試 上進行了評估,使用 MrLight_dse-qwen2-2b-mrl-v1 檢索 10 個元素並進行重排序。下表總結了其 ndcg@5
得分:
數據集 |
MrLight_dse-qwen2-2b-mrl-v1 |
MonoQwen2-VL-v0.1 重排序 |
vidore/arxivqa_test_subsampled |
85.6 |
89.0 |
vidore/docvqa_test_subsampled |
57.1 |
59.7 |
vidore/infovqa_test_subsampled |
88.1 |
93.2 |
vidore/tabfquad_test_subsampled |
93.1 |
96.0 |
vidore/shiftproject_test |
82.0 |
93.0 |
vidore/syntheticDocQA_artificial_intelligence_test |
97.5 |
100.0 |
vidore/syntheticDocQA_energy_test |
92.9 |
97.7 |
vidore/syntheticDocQA_government_reports_test |
96.0 |
98.0 |
vidore/syntheticDocQA_healthcare_industry_test |
96.4 |
99.3 |
vidore/tatdqa_test |
69.4 |
79.0 |
均值 |
85.8 |
90.5 |
📄 許可證
這個 LoRA 模型遵循 Apache 2.0 許可證。
引用
如果您發現該模型有用,請考慮引用我們的工作:
@misc{MonoQwen,
title={MonoQwen: Visual Document Reranking},
author={Chaffin, Antoine and Lac, Aurélien},
url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
year={2024}
}