🚀 MonoQwen2-VL-v0.1
MonoQwen2-VL-v0.1 是一个多模态重排器,它基于 Qwen2-VL-2B 利用 LoRA 进行微调。该模型针对使用 MonoT5 目标来确定图像与查询的逐点相关性进行了优化。也就是说,将一组图像和查询输入到视觉语言模型(VLM)的提示中,如果图像与查询相关,模型将生成 "True",否则生成 "False"。在推理过程中,可以通过比较这两个标记的对数概率来获得相关性得分,该得分可有效用于对第一阶段检索器(如 DSE 或 ColPali)生成的候选结果进行重排序,或使用阈值对其进行过滤。
🚀 快速开始
下面是使用该模型对单张图像与用户查询进行重排序的快速示例:
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
"lightonai/MonoQwen2-VL-v0.1",
device_map="auto",
)
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)
prompt = (
"Assert the relevance of the previous image document to the following query, "
"answer True or False. The query is: {query}"
).format(query=query)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
logits_for_last_token = outputs.logits[:, -1, :]
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()
print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")
这个示例展示了如何使用该模型评估图像与查询的相关性。它输出图像相关("True")或不相关("False")的概率。
⚠️ 重要提示
此示例要求在您的环境中安装 peft
(pip install peft
)。如果您不想使用 peft
,可以在原始的 Qwen2-VL-2B 模型上使用 model.load_adapter。
💻 使用示例
基础用法
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
"lightonai/MonoQwen2-VL-v0.1",
device_map="auto",
)
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)
prompt = (
"Assert the relevance of the previous image document to the following query, "
"answer True or False. The query is: {query}"
).format(query=query)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
logits_for_last_token = outputs.logits[:, -1, :]
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()
print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")
性能指标
该模型在 ViDoRe 基准测试 上进行了评估,使用 MrLight_dse-qwen2-2b-mrl-v1 检索 10 个元素并进行重排序。下表总结了其 ndcg@5
得分:
数据集 |
MrLight_dse-qwen2-2b-mrl-v1 |
MonoQwen2-VL-v0.1 重排序 |
vidore/arxivqa_test_subsampled |
85.6 |
89.0 |
vidore/docvqa_test_subsampled |
57.1 |
59.7 |
vidore/infovqa_test_subsampled |
88.1 |
93.2 |
vidore/tabfquad_test_subsampled |
93.1 |
96.0 |
vidore/shiftproject_test |
82.0 |
93.0 |
vidore/syntheticDocQA_artificial_intelligence_test |
97.5 |
100.0 |
vidore/syntheticDocQA_energy_test |
92.9 |
97.7 |
vidore/syntheticDocQA_government_reports_test |
96.0 |
98.0 |
vidore/syntheticDocQA_healthcare_industry_test |
96.4 |
99.3 |
vidore/tatdqa_test |
69.4 |
79.0 |
均值 |
85.8 |
90.5 |
📄 许可证
这个 LoRA 模型遵循 Apache 2.0 许可证。
引用
如果您发现该模型有用,请考虑引用我们的工作:
@misc{MonoQwen,
title={MonoQwen: Visual Document Reranking},
author={Chaffin, Antoine and Lac, Aurélien},
url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
year={2024}
}