VLM2Vec-V2.0開源模型 - 為視頻、圖像等多模態數據提供強大嵌入能力

首頁

Vlm2vec V2.0

由VLM2Vec開發

VLM2Vec-V2 是一個用於大規模多模態嵌入任務的模型，通過訓練視覺語言模型，為視頻、圖像和視覺文檔等多模態數據提供更強大的嵌入能力。

多模態融合

Transformers

英語開源協議:Apache-2.0 #多模態嵌入 #視頻理解 #大規模預訓練

下載量 2,527

發布時間 : 4/30/2025

模型概述

VLM2Vec-V2 是一個視覺語言模型，專注於為多模態數據（如視頻、圖像和視覺文檔）生成強大的嵌入表示。它在多模態評估基準（MMEB）上表現出色，具有廣泛的應用前景。

模型特點

多模態嵌入能力

能夠為視頻、圖像和視覺文檔等多種模態數據生成高質量的嵌入表示。

高性能

在多模態評估基準（MMEB）上取得了優秀的實驗結果。

廣泛的應用前景

適用於多種多模態任務，如視頻理解、圖像檢索等。

模型能力

視頻嵌入

圖像嵌入

視覺文檔嵌入

多模態相似度計算

使用案例

視頻理解

視頻描述生成

通過視頻嵌入生成視頻內容的描述。

能夠準確描述視頻內容，如示例中的'一個穿灰色毛衣的男人在雪地裡和他的狗玩接球遊戲'。

圖像檢索

圖像相似度計算

計算圖像與文本描述的相似度。

能夠準確計算圖像與文本描述的相似度分數。

🚀 VLM2Vec-V2

VLM2Vec-V2 是一個用於大規模多模態嵌入任務的模型，通過訓練視覺語言模型，為視頻、圖像和視覺文檔等多模態數據提供更強大的嵌入能力。它在多模態評估基準（MMEB）上取得了優秀的實驗結果，具有廣泛的應用前景。

🚀 快速開始

🌟 新特性

[2025.07] 發佈技術報告。
[2025.05] 首次發佈 MMEB-V2/VLM2Vec-V2。

📊 實驗結果

我們提供了在 MMEB-V2 上的實驗結果。 abs 詳細的排行榜請見此處。

💻 使用示例

基礎用法

我們在 Github 上提供了演示示例。

from src.arguments import ModelArguments, DataArguments
from src.model.model import MMEBModel
from src.model.processor import load_processor, QWEN2_VL, VLM_VIDEO_TOKENS
import torch
from src.model.vlm_backbone.qwen2_vl.qwen_vl_utils import process_vision_info

model_args = ModelArguments(
    model_name='Qwen/Qwen2-VL-7B-Instruct',
    checkpoint_path='TIGER-Lab/VLM2Vec-Qwen2VL-7B',
    pooling='last',
    normalize=True,
    model_backbone='qwen2_vl',
    lora=True
)
data_args = DataArguments()

processor = load_processor(model_args, data_args)
model = MMEBModel.load(model_args)
model = model.to('cuda', dtype=torch.bfloat16)
model.eval()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "assets/example_video.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=f'{VLM_VIDEO_TOKENS[QWEN2_VL]} Represent the given video.',
    videos=video_inputs,
    return_tensors="pt"
)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
inputs['pixel_values_videos'] = inputs['pixel_values_videos'].unsqueeze(0)
inputs['video_grid_thw'] = inputs['video_grid_thw'].unsqueeze(0)
qry_output = model(qry=inputs)["qry_reps"]

string = 'A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.'
inputs = processor(text=string,
                   images=None,
                   return_tensors="pt")
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## tensor([[0.4746]], device='cuda:0', dtype=torch.bfloat16)

string = 'A person dressed in a blue jacket shovels the snow-covered pavement outside their house.'
inputs = processor(text=string,
                   images=None,
                   return_tensors="pt")
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## tensor([[0.3223]], device='cuda:0', dtype=torch.bfloat16)

📚 引用

如果您使用了該項目，請引用以下論文：

@article{jiang2024vlm2vec,
  title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
  author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
  journal={arXiv preprint arXiv:2410.05160},
  year={2024}
}

@article{meng2025vlm2vecv2,
  title={VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents},
  author={Rui Meng and Ziyan Jiang and Ye Liu and Mingyi Su and Xinyi Yang and Yuepeng Fu and Can Qin and Zeyuan Chen and Ran Xu and Caiming Xiong and Yingbo Zhou and Wenhu Chen and Semih Yavuz},
  journal={arXiv preprint arXiv:2507.04590},
  year={2025}
}