MonoQwen2-VL-v0.1 Open-Source Multimodal Re-ranker - Accurately Evaluate the Relevance between Images and Queries

Monoqwen2 VL V0.1

Developed by lightonai

MonoQwen2-VL-v0.1 is a multimodal re-ranker fine-tuned based on Qwen2-VL-2B, used to evaluate the relevance between images and queries.

Image-to-Text Open Source License:Apache-2.0 #Multimodal re-ranking #Visual document retrieval #LoRA fine-tuning

Downloads 547

Release Time : 10/25/2024

Model Overview

This model optimizes the point-wise relevance judgment between images and queries through LoRA fine-tuning, can generate True or False responses, and calculate relevance scores. It is suitable for re-ranking or filtering retrieval results.

Model Features

Multimodal re-ranking

Supports evaluating the relevance between images and text queries and generating True or False responses.

LoRA fine-tuning

Efficiently fine-tunes based on the Qwen2-VL-2B model through LoRA to optimize the relevance judgment task.

High performance

Performs excellently in the ViDoRe benchmark test, significantly improving the ndcg@5 score of retrieval results.

Model Capabilities

Image and text relevance assessment

Multimodal retrieval result re-ranking

Generate True/False responses

Use Cases

Information retrieval

Document retrieval re-ranking

Re-rank the candidate results generated by the first-stage retriever (such as DSE or ColPali) to improve retrieval quality.

In the ViDoRe benchmark test, the average ndcg@5 score is increased by 4.7%.

Image filtering

Image relevance filtering

Filter out images irrelevant to the query by setting a threshold to improve retrieval efficiency.

🚀 MonoQwen2-VL-v0.1

MonoQwen2-VL-v0.1 is a multimodal reranker designed for visual document retrieval. It is fine - tuned from Qwen2 - VL - 2B, enabling it to accurately assess the relevance between images and queries. This model can effectively rerank candidate images, enhancing the precision of visual document retrieval.

🚀 Quick Start

✨ Features

Multimodal Reranking: Finetuned from Qwen2-VL-2B using LoRA, it is optimized for asserting pointwise image - query relevance.
Effective Scoring: During inference, it can generate a relevancy score by comparing the logits of "True" and "False" tokens, which can be used to rerank candidates or filter them.
Trained on Specific Data: Trained on the ColPali train set with negatives mined using DSE.

📦 Installation

This example requires peft to be installed in your environment. You can install it using the following command:

pip install peft

💻 Usage Examples

Basic Usage

Below is a quick example to rerank a single image against a user query using this model:

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# Load processor and model
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# Define query and load image
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# Construct the prompt and prepare input
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# Apply chat template and tokenize
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# Run inference to obtain logits
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# Convert tokens and calculate relevance score
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# Extract and display probabilities
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

This example demonstrates how to use the model to assess the relevance of an image with respect to a query. It outputs the probability that the image is relevant ("True") or not relevant ("False").

⚠️ Important Note

If you don't want to use peft, you can use model.load_adapter on the original Qwen2 - VL - 2B model.

📚 Documentation

Performance Metrics

The model has been evaluated on ViDoRe Benchmark, by retrieving 10 elements with MrLight_dse-qwen2-2b-mrl-v1 and reranking them. The table below summarizes its ndcg@5 scores:

Dataset	MrLight_dse-qwen2-2b-mrl-v1	MonoQwen2-VL-v0.1 reranking
vidore/arxivqa_test_subsampled	85.6	89.0
vidore/docvqa_test_subsampled	57.1	59.7
vidore/infovqa_test_subsampled	88.1	93.2
vidore/tabfquad_test_subsampled	93.1	96.0
vidore/shiftproject_test	82.0	93.0
vidore/syntheticDocQA_artificial_intelligence_test	97.5	100.0
vidore/syntheticDocQA_energy_test	92.9	97.7
vidore/syntheticDocQA_government_reports_test	96.0	98.0
vidore/syntheticDocQA_healthcare_industry_test	96.4	99.3
vidore/tatdqa_test	69.4	79.0
Mean	85.8	90.5

📄 License

This LoRA model is licensed under the Apache 2.0 license.

🔧 Technical Details

The MonoQwen2-VL-v0.1 is a multimodal reranker finetuned with LoRA from Qwen2-VL-2B. It uses the MonoT5 objective for asserting pointwise image - query relevance. Given a couple of image and query fed into the prompt of the VLM, the model generates "True" if the image is relevant to the query and "False" otherwise. During inference, a relevancy score is obtained by comparing the logits of the two tokens, which can be used to rerank candidates or filter them.

Citation

If you find the model useful, consider citing our work:

@misc{MonoQwen,
  title={MonoQwen: Visual Document Reranking},
  author={Chaffin, Antoine and Lac, Aur√©lien},
  url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご