VLM2Vec-V2.0开源模型 - 为视频、图像等多模态数据提供强大嵌入能力

首页

Vlm2vec V2.0

由 VLM2Vec 开发

VLM2Vec-V2 是一个用于大规模多模态嵌入任务的模型，通过训练视觉语言模型，为视频、图像和视觉文档等多模态数据提供更强大的嵌入能力。

多模态融合

Transformers

英语开源协议:Apache-2.0 #多模态嵌入 #视频理解 #大规模预训练

下载量 2,527

发布时间 : 4/30/2025

模型简介

VLM2Vec-V2 是一个视觉语言模型，专注于为多模态数据（如视频、图像和视觉文档）生成强大的嵌入表示。它在多模态评估基准（MMEB）上表现出色，具有广泛的应用前景。

模型特点

多模态嵌入能力

能够为视频、图像和视觉文档等多种模态数据生成高质量的嵌入表示。

高性能

在多模态评估基准（MMEB）上取得了优秀的实验结果。

广泛的应用前景

适用于多种多模态任务，如视频理解、图像检索等。

模型能力

视频嵌入

图像嵌入

视觉文档嵌入

多模态相似度计算

使用案例

视频理解

视频描述生成

通过视频嵌入生成视频内容的描述。

能够准确描述视频内容，如示例中的'一个穿灰色毛衣的男人在雪地里和他的狗玩接球游戏'。

图像检索

图像相似度计算

计算图像与文本描述的相似度。

能够准确计算图像与文本描述的相似度分数。

🚀 VLM2Vec-V2

VLM2Vec-V2 是一个用于大规模多模态嵌入任务的模型，通过训练视觉语言模型，为视频、图像和视觉文档等多模态数据提供更强大的嵌入能力。它在多模态评估基准（MMEB）上取得了优秀的实验结果，具有广泛的应用前景。

🚀 快速开始

🌟 新特性

[2025.07] 发布技术报告。
[2025.05] 首次发布 MMEB-V2/VLM2Vec-V2。

📊 实验结果

我们提供了在 MMEB-V2 上的实验结果。 abs 详细的排行榜请见此处。

💻 使用示例

基础用法

我们在 Github 上提供了演示示例。

from src.arguments import ModelArguments, DataArguments
from src.model.model import MMEBModel
from src.model.processor import load_processor, QWEN2_VL, VLM_VIDEO_TOKENS
import torch
from src.model.vlm_backbone.qwen2_vl.qwen_vl_utils import process_vision_info

model_args = ModelArguments(
    model_name='Qwen/Qwen2-VL-7B-Instruct',
    checkpoint_path='TIGER-Lab/VLM2Vec-Qwen2VL-7B',
    pooling='last',
    normalize=True,
    model_backbone='qwen2_vl',
    lora=True
)
data_args = DataArguments()

processor = load_processor(model_args, data_args)
model = MMEBModel.load(model_args)
model = model.to('cuda', dtype=torch.bfloat16)
model.eval()

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "assets/example_video.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=f'{VLM_VIDEO_TOKENS[QWEN2_VL]} Represent the given video.',
    videos=video_inputs,
    return_tensors="pt"
)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
inputs['pixel_values_videos'] = inputs['pixel_values_videos'].unsqueeze(0)
inputs['video_grid_thw'] = inputs['video_grid_thw'].unsqueeze(0)
qry_output = model(qry=inputs)["qry_reps"]

string = 'A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.'
inputs = processor(text=string,
                   images=None,
                   return_tensors="pt")
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## tensor([[0.4746]], device='cuda:0', dtype=torch.bfloat16)

string = 'A person dressed in a blue jacket shovels the snow-covered pavement outside their house.'
inputs = processor(text=string,
                   images=None,
                   return_tensors="pt")
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))
## tensor([[0.3223]], device='cuda:0', dtype=torch.bfloat16)

📚 引用

如果您使用了该项目，请引用以下论文：

@article{jiang2024vlm2vec,
  title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
  author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
  journal={arXiv preprint arXiv:2410.05160},
  year={2024}
}

@article{meng2025vlm2vecv2,
  title={VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents},
  author={Rui Meng and Ziyan Jiang and Ye Liu and Mingyi Su and Xinyi Yang and Yuepeng Fu and Can Qin and Zeyuan Chen and Ran Xu and Caiming Xiong and Yingbo Zhou and Wenhu Chen and Semih Yavuz},
  journal={arXiv preprint arXiv:2507.04590},
  year={2025}
}