🚀 MonoQwen2-VL-v0.1
MonoQwen2-VL-v0.1 is a multimodal reranker designed for visual document retrieval. It is fine - tuned from Qwen2 - VL - 2B, enabling it to accurately assess the relevance between images and queries. This model can effectively rerank candidate images, enhancing the precision of visual document retrieval.
🚀 Quick Start
✨ Features
- Multimodal Reranking: Finetuned from Qwen2-VL-2B using LoRA, it is optimized for asserting pointwise image - query relevance.
- Effective Scoring: During inference, it can generate a relevancy score by comparing the logits of "True" and "False" tokens, which can be used to rerank candidates or filter them.
- Trained on Specific Data: Trained on the ColPali train set with negatives mined using DSE.
📦 Installation
This example requires peft
to be installed in your environment. You can install it using the following command:
pip install peft
💻 Usage Examples
Basic Usage
Below is a quick example to rerank a single image against a user query using this model:
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
"lightonai/MonoQwen2-VL-v0.1",
device_map="auto",
)
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)
prompt = (
"Assert the relevance of the previous image document to the following query, "
"answer True or False. The query is: {query}"
).format(query=query)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
logits_for_last_token = outputs.logits[:, -1, :]
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()
print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")
This example demonstrates how to use the model to assess the relevance of an image with respect to a query. It outputs the probability that the image is relevant ("True") or not relevant ("False").
⚠️ Important Note
If you don't want to use peft
, you can use model.load_adapter on the original Qwen2 - VL - 2B model.
📚 Documentation
Performance Metrics
The model has been evaluated on ViDoRe Benchmark, by retrieving 10 elements with MrLight_dse-qwen2-2b-mrl-v1 and reranking them. The table below summarizes its ndcg@5
scores:
Dataset |
MrLight_dse-qwen2-2b-mrl-v1 |
MonoQwen2-VL-v0.1 reranking |
vidore/arxivqa_test_subsampled |
85.6 |
89.0 |
vidore/docvqa_test_subsampled |
57.1 |
59.7 |
vidore/infovqa_test_subsampled |
88.1 |
93.2 |
vidore/tabfquad_test_subsampled |
93.1 |
96.0 |
vidore/shiftproject_test |
82.0 |
93.0 |
vidore/syntheticDocQA_artificial_intelligence_test |
97.5 |
100.0 |
vidore/syntheticDocQA_energy_test |
92.9 |
97.7 |
vidore/syntheticDocQA_government_reports_test |
96.0 |
98.0 |
vidore/syntheticDocQA_healthcare_industry_test |
96.4 |
99.3 |
vidore/tatdqa_test |
69.4 |
79.0 |
Mean |
85.8 |
90.5 |
📄 License
This LoRA model is licensed under the Apache 2.0 license.
🔧 Technical Details
The MonoQwen2-VL-v0.1 is a multimodal reranker finetuned with LoRA from Qwen2-VL-2B. It uses the MonoT5 objective for asserting pointwise image - query relevance. Given a couple of image and query fed into the prompt of the VLM, the model generates "True" if the image is relevant to the query and "False" otherwise. During inference, a relevancy score is obtained by comparing the logits of the two tokens, which can be used to rerank candidates or filter them.
Citation
If you find the model useful, consider citing our work:
@misc{MonoQwen,
title={MonoQwen: Visual Document Reranking},
author={Chaffin, Antoine and Lac, Aurélien},
url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
year={2024}
}