MonoQwen2-VL-v0.1开源多模态重排器 - 精准评估图像与查询相关性

首页

Monoqwen2 VL V0.1

由 lightonai 开发

MonoQwen2-VL-v0.1 是一个基于 Qwen2-VL-2B 微调的多模态重排器，用于评估图像与查询的相关性。

图像生成文本开源协议:Apache-2.0 #多模态重排 #视觉文档检索 #LoRA微调

下载量 547

发布时间 : 10/25/2024

模型简介

该模型通过 LoRA 微调优化了图像与查询的逐点相关性判断，能够生成 True 或 False 的响应，并计算相关性得分，适用于对检索结果进行重排序或过滤。

模型特点

多模态重排

支持对图像和文本查询的相关性进行评估，生成 True 或 False 的响应。

LoRA 微调

基于 Qwen2-VL-2B 模型通过 LoRA 进行高效微调，优化相关性判断任务。

高性能

在 ViDoRe 基准测试中表现优异，显著提升检索结果的 ndcg@5 得分。

模型能力

图像与文本相关性评估

多模态检索结果重排序

生成 True/False 响应

使用案例

信息检索

文档检索重排序

对第一阶段检索器（如 DSE 或 ColPali）生成的候选结果进行重排序，提升检索质量。

在 ViDoRe 基准测试中，ndcg@5 得分平均提升 4.7%。

图像过滤

图像相关性过滤

通过设定阈值过滤与查询不相关的图像，提升检索效率。

🚀 MonoQwen2-VL-v0.1

MonoQwen2-VL-v0.1 是一个多模态重排器，它基于 Qwen2-VL-2B 利用 LoRA 进行微调。该模型针对使用 MonoT5 目标来确定图像与查询的逐点相关性进行了优化。也就是说，将一组图像和查询输入到视觉语言模型（VLM）的提示中，如果图像与查询相关，模型将生成 "True"，否则生成 "False"。在推理过程中，可以通过比较这两个标记的对数概率来获得相关性得分，该得分可有效用于对第一阶段检索器（如 DSE 或 ColPali）生成的候选结果进行重排序，或使用阈值对其进行过滤。

🚀 快速开始

下面是使用该模型对单张图像与用户查询进行重排序的快速示例：

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# 加载处理器和模型
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# 定义查询并加载图像
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# 构建提示并准备输入
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# 应用聊天模板并进行分词
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# 运行推理以获得对数概率
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# 转换标记并计算相关性得分
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# 提取并显示概率
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

这个示例展示了如何使用该模型评估图像与查询的相关性。它输出图像相关（"True"）或不相关（"False"）的概率。

⚠️ 重要提示

此示例要求在您的环境中安装 peft（pip install peft）。如果您不想使用 peft，可以在原始的 Qwen2-VL-2B 模型上使用 model.load_adapter。

💻 使用示例

基础用法

# 上述示例代码即为基础用法示例
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# 加载处理器和模型
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# 定义查询并加载图像
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# 构建提示并准备输入
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# 应用聊天模板并进行分词
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# 运行推理以获得对数概率
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# 转换标记并计算相关性得分
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# 提取并显示概率
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

性能指标

该模型在 ViDoRe 基准测试上进行了评估，使用 MrLight_dse-qwen2-2b-mrl-v1 检索 10 个元素并进行重排序。下表总结了其 ndcg@5 得分：

数据集	MrLight_dse-qwen2-2b-mrl-v1	MonoQwen2-VL-v0.1 重排序
vidore/arxivqa_test_subsampled	85.6	89.0
vidore/docvqa_test_subsampled	57.1	59.7
vidore/infovqa_test_subsampled	88.1	93.2
vidore/tabfquad_test_subsampled	93.1	96.0
vidore/shiftproject_test	82.0	93.0
vidore/syntheticDocQA_artificial_intelligence_test	97.5	100.0
vidore/syntheticDocQA_energy_test	92.9	97.7
vidore/syntheticDocQA_government_reports_test	96.0	98.0
vidore/syntheticDocQA_healthcare_industry_test	96.4	99.3
vidore/tatdqa_test	69.4	79.0
均值	85.8	90.5

📄 许可证

这个 LoRA 模型遵循 Apache 2.0 许可证。

引用

如果您发现该模型有用，请考虑引用我们的工作：

@misc{MonoQwen,
  title={MonoQwen: Visual Document Reranking},
  author={Chaffin, Antoine and Lac, Aur√©lien},
  url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
  year={2024}
}