MonoQwen2-VL-v0.1オープンソース多モーダルリレコーダー - 画像とクエリの関連性を正確に評価

Home

Monoqwen2 VL V0.1

Developed by lightonai

MonoQwen2-VL-v0.1は、Qwen2-VL-2Bをベースに微調整されたマルチモーダル再ランキングモデルで、画像とクエリの関連性を評価するために使用されます。

画像生成テキスト Open Source License:Apache-2.0 #マルチモーダル再ランキング #ビジュアルドキュメント検索 #LoRA微調整

Downloads 547

Release Time : 10/25/2024

Model Overview

このモデルは、LoRA微調整により画像とクエリのポイントごとの関連性判断を最適化し、TrueまたはFalseの応答を生成し、関連性スコアを計算することができます。検索結果の再ランキングまたはフィルタリングに適しています。

Model Features

マルチモーダル再ランキング

画像とテキストクエリの関連性を評価し、TrueまたはFalseの応答を生成することをサポートします。

LoRA微調整

Qwen2-VL-2BモデルをベースにLoRAを使用して効率的に微調整し、関連性判断タスクを最適化します。

高性能

ViDoReベンチマークテストで優れた性能を発揮し、検索結果のndcg@5スコアを大幅に向上させます。

Model Capabilities

画像とテキストの関連性評価

マルチモーダル検索結果の再ランキング

True/False応答の生成

Use Cases

情報検索

ドキュメント検索の再ランキング

第1段階の検索器（DSEまたはColPaliなど）が生成した候補結果を再ランキングし、検索品質を向上させます。

ViDoReベンチマークテストで、ndcg@5スコアが平均4.7%向上しました。

画像フィルタリング

画像関連性フィルタリング

閾値を設定することで、クエリと関連性のない画像をフィルタリングし、検索効率を向上させます。

🚀 MonoQwen2-VL-v0.1

MonoQwen2-VL-v0.1は、Qwen2-VL-2B からLoRAを用いて微調整されたマルチモーダルなリランカーです。このモデルは、MonoT5 の目的を使用して、画像とクエリのポイントワイズな関連性を評価するように最適化されています。つまり、画像とクエリのペアをビジュアル言語モデル（VLM）のプロンプトに入力すると、画像がクエリに関連する場合は「True」を、関連しない場合は「False」を生成するように設計されています。推論時には、2つのトークンのロジットを比較することで関連性スコアを取得でき、このスコアを使用して、一次検索器（DSEやColPaliなど）が生成した候補をリランクしたり、閾値を使用してフィルタリングしたりすることができます。

このモデルは、DSEを使用してマイニングされた負例を含む ColPaliトレーニングセットを使用してトレーニングされています。

🚀 クイックスタート

このモデルを使用して、ユーザーのクエリに対して単一の画像をリランクする簡単な例を以下に示します。

import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# Load processor and model
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# Define query and load image
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# Construct the prompt and prepare input
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# Apply chat template and tokenize
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# Run inference to obtain logits
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# Convert tokens and calculate relevance score
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# Extract and display probabilities
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

この例では、モデルを使用して画像とクエリの関連性を評価する方法を示しています。画像が関連する（"True"）または関連しない（"False"）確率を出力します。

⚠️ 重要提示

この例では、環境に peft をインストールする必要があります (pip install peft)。peft を使用したくない場合は、元のQwen2-VL-2Bモデルで model.load_adapter load_adapter を使用することができます。

💻 使用例

基本的な使用法

# 上記のコード例を再掲
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# Load processor and model
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# Define query and load image
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# Construct the prompt and prepare input
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# Apply chat template and tokenize
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# Run inference to obtain logits
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# Convert tokens and calculate relevance score
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# Extract and display probabilities
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")

📚 ドキュメント

性能指標

このモデルは、ViDoReベンチマークで評価されています。具体的には、MrLight_dse-qwen2-2b-mrl-v1 を使用して10個の要素を検索し、それらをリランクしています。以下の表は、ndcg@5 スコアをまとめたものです。

データセット	MrLight_dse-qwen2-2b-mrl-v1	MonoQwen2-VL-v0.1 リランク
vidore/arxivqa_test_subsampled	85.6	89.0
vidore/docvqa_test_subsampled	57.1	59.7
vidore/infovqa_test_subsampled	88.1	93.2
vidore/tabfquad_test_subsampled	93.1	96.0
vidore/shiftproject_test	82.0	93.0
vidore/syntheticDocQA_artificial_intelligence_test	97.5	100.0
vidore/syntheticDocQA_energy_test	92.9	97.7
vidore/syntheticDocQA_government_reports_test	96.0	98.0
vidore/syntheticDocQA_healthcare_industry_test	96.4	99.3
vidore/tatdqa_test	69.4	79.0
平均	85.8	90.5

📄 ライセンス

このLoRAモデルは、Apache 2.0ライセンスの下で提供されています。

引用

もしこのモデルが役に立った場合は、以下のように引用を考慮してください。

@misc{MonoQwen,
  title={MonoQwen: Visual Document Reranking},
  author={Chaffin, Antoine and Lac, Aur√©lien},
  url={https://huggingface.co/lightonai/MonoQwen2-VL-v0.1},
  year={2024}
}