VLM2Vec - V2.0オープンソースモデル - 動画、画像などのマルチモーダルデータに強力な埋め込み能力を提供

ホーム

Vlm2vec V2.0

VLM2Vecによって開発

VLM2Vec-V2は、大規模マルチモーダル埋め込みタスクに使用されるモデルで、ビジュアル言語モデルを学習することで、ビデオ、画像、ビジュアルドキュメントなどのマルチモーダルデータに対してより強力な埋め込み能力を提供します。

マルチモーダル融合

Transformers

英語オープンソースライセンス:Apache-2.0 #マルチモーダル埋め込み #ビデオ理解 #大規模事前学習

ダウンロード数 2,527

リリース時間 : 4/30/2025

モデル概要

VLM2Vec-V2は、ビジュアル言語モデルであり、ビデオ、画像、ビジュアルドキュメントなどのマルチモーダルデータに対して強力な埋め込み表現を生成することに特化しています。マルチモーダル評価ベンチマーク（MMEB）で優れた性能を発揮し、幅広い応用可能性を持っています。

モデル特徴

マルチモーダル埋め込み能力

ビデオ、画像、ビジュアルドキュメントなどの様々なモーダルのデータに対して高品質な埋め込み表現を生成することができます。

高性能

マルチモーダル評価ベンチマーク（MMEB）で優れた実験結果を得ています。

幅広い応用可能性

ビデオ理解、画像検索などの様々なマルチモーダルタスクに適用できます。

モデル能力

ビデオ埋め込み

画像埋め込み

ビジュアルドキュメント埋め込み

マルチモーダル類似度計算

使用事例

ビデオ理解

ビデオ説明生成

ビデオ埋め込みを通じてビデオ内容の説明を生成します。

ビデオ内容を正確に説明することができます。例えば、「灰色のセーターを着た男性が雪の中で彼の犬とキャッチボールをしている」という例のように。

画像検索

画像類似度計算

画像とテキスト説明の類似度を計算します。

画像とテキスト説明の類似度スコアを正確に計算することができます。

🚀 VLM2Vec-V2

VLM2Vec-V2は、マルチモーダル埋め込みタスク向けのビジョン言語モデルです。このモデルは、多様なデータセットを用いて訓練され、高い性能を発揮します。

🚀 クイックスタート

新機能

[2025.07] 技術レポートを公開しました。
[2025.05] MMEB-V2/VLM2Vec-V2の初期リリース。

実験結果

MMEB-V2での結果を提供しています。 abs 詳細なリーダーボードはこちらにあります。

VLM2Vecの使い方

Githubにデモ例を用意しています。

from src.arguments import ModelArguments, DataArguments
from src.model.model import MMEBModel
from src.model.processor import load_processor, QWEN2_VL, VLM_VIDEO_TOKENS
import torch
from src.model.vlm_backbone.qwen2_vl.qwen_vl_utils import process_vision_info

model_args = ModelArguments(
    model_name='Qwen/Qwen2-VL-7B-Instruct',
    checkpoint_path='TIGER-Lab/VLM2Vec-Qwen2VL-7B',
    pooling='last',
    normalize=True,
    model_backbone='qwen2_vl',
    lora=True
)
data_args = DataArguments()

processor = load_processor(model_args, data_args)
model = MMEBModel.load(model_args)
model = model.to('cuda', dtype=torch.bfloat16)
model.eval()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "assets/example_video.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=f'{VLM_VIDEO_TOKENS[QWEN2_VL]} Represent the given video.',
    videos=video_inputs,
    return_tensors="pt"
)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
inputs['pixel_values_videos'] = inputs['pixel_values_videos'].unsqueeze(0)
inputs['video_grid_thw'] = inputs['video_grid_thw'].unsqueeze(0)
qry_output = model(qry=inputs)["qry_reps"]

string = 'A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.'
inputs = processor(text=string,
                   images=None,
                   return_tensors="pt")
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## tensor([[0.4746]], device='cuda:0', dtype=torch.bfloat16)

string = 'A person dressed in a blue jacket shovels the snow-covered pavement outside their house.'
inputs = processor(text=string,
                   images=None,
                   return_tensors="pt")
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## tensor([[0.3223]], device='cuda:0', dtype=torch.bfloat16)

引用

@article{jiang2024vlm2vec,
  title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
  author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
  journal={arXiv preprint arXiv:2410.05160},
  year={2024}
}

@article{meng2025vlm2vecv2,
  title={VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents},
  author={Rui Meng and Ziyan Jiang and Ye Liu and Mingyi Su and Xinyi Yang and Yuepeng Fu and Can Qin and Zeyuan Chen and Ran Xu and Caiming Xiong and Yingbo Zhou and Wenhu Chen and Semih Yavuz},
  journal={arXiv preprint arXiv:2507.04590},
  year={2025}
}