๐ GME-VARCO-VISION-Embedding
GME-VARCO-VISION-Embedding is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high - dimensional embedding space, excelling in video retrieval tasks.
๐ Quick Start
GME-VARCO-VISION-Embedding
is a powerful multimodal embedding model. It can calculate semantic similarity among text, images, and videos in a high - dimensional embedding space. Specifically, it shows great performance in video retrieval, which is more complex than image retrieval and requires a deeper understanding of context. It achieves high retrieval accuracy and strong generalization ability in various scenarios, such as scene - based search, description - based search, and question - answering - based search.
โจ Features
- Multimodal Embedding: Computes semantic similarity between text, images, and videos.
- Video Retrieval Focus: Specialized for video retrieval tasks with high accuracy and generalization.
- SOTA Performance: Achieves state - of - the - art zero - shot performance on the MultiVENT2.0 dataset as of July 2025.
๐ฆ Installation
The document does not provide installation steps, so this section is skipped.
๐ป Usage Examples
Basic Usage
Image - Text Retrieval
import torch
import requests
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model_name = "NCSOFT/GME-VARCO-VISION-Embedding"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
device = model.device
qry_msg = [
{
"role": "user",
"content": [
{"type": "text", "text": "Find a photo of a cat."},
],
},
]
qry_txt = processor.apply_chat_template(
qry_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token
qry_input = processor(
text=[qry_txt],
padding=True,
return_tensors="pt",
).to(device)
img_msg = [
{
"role": "user",
"content": [{
"type": "image",
"image": "image"
}]
}
]
img_txt = processor.apply_chat_template(
img_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token
candidate_imgs= [
{
"role": "user",
"content": [{
"type": "image",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg"}]
},
{
"role": "user",
"content": [{
"type": "image",
"image": "https://farm1.staticflickr.com/116/290755713_a5de6c1079_z.jpg"}]
},
{
"role": "user",
"content": [{
"type": "image",
"image": "http://farm3.staticflickr.com/2418/2193688811_d9f5e23bbd_z.jpg"}]
},
{
"role": "user",
"content": [{
"type": "image",
"image":"http://farm7.staticflickr.com/6049/6329686751_997c68fff9_z.jpg"}]
},
]
candidate_images, _ = process_vision_info(candidate_imgs)
image_inputs = processor(
text=[img_txt] * len(candidate_images),
images=candidate_images,
padding=True,
return_tensors="pt",
).to(device)
with torch.inference_mode():
qry_emb = model(
**qry_input, output_hidden_states=True, return_dict=True
).hidden_states[-1][:, -1, :]
img_emb = model(
**image_inputs, output_hidden_states=True, return_dict=True
).hidden_states[-1][:, -1, :]
qry_emb = F.normalize(qry_emb, dim=-1)
img_emb = F.normalize(img_emb, dim=-1)
score = qry_emb @ img_emb.t()
Video Embedding
vid_message = [
{
"role": "user",
"content": [{
"type": "video",
"video": video_path,
"max_pixels": 262144,
"fps": 2.0,}]
}
]
video_text = processor.apply_chat_template(
vid_message, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token
image_input, video_input = process_vision_info(vid_message)
video_input = processor(
text=[video_text],
images=image_input,
videos=video_input,
padding=True,
return_tensors="pt",
).to(device)
with torch.inference_mode():
video_emb = model(
**video_input, output_hidden_states=True, return_dict=True
).hidden_states[-1][:, -1, :]
video_emb = F.normalize(video_emb, dim=-1)
๐ Documentation
Demo Video
Check out our demo videos showcasing our multimodal embedding model in action:
The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.
Model Architecture and Training Method
GME-VARCO-VISION-Embedding
is based on Qwen/Qwen2-VL-7B-Instruct
, and uses the parameters of Alibaba-NLP/gme-Qwen2-VL-7B-Instruct
to improve the model's general retrieval ability.
1. Fine - tuning (Contrastive Learning) on video preference dataset
To efficiently fine - tune the model, we utilize ShareGPTVideoโs 17๐ video preference dataset, which includes prompts, videos, gold answers, and chosen - rejected text pairs. We treat the prompts and videos as queries, and the rejected responses as hard - negatives for the gold answers. Each query is trained with in - batch negatives as well as one hard negative using the InfoNCE loss. The model is fully fine - tuned for two epochs on 8 A100 GPUs with a batch size of 8, requiring only a few hours for training.
2. Adding Retrieval Vector
To compensate for the insufficiency of training instances and enhance the generalization ability of the fine - tuned model, we compute a retrieval vector ๐ by subtracting the weights of the original Qwen/Qwen2-VL-7B-Instruct
model from those of Alibaba-NLP/gme-Qwen2-VL-7B-Instruct
, a Qwen2 - VL based image - text retrieval model. This approach is inspired by Chat Vector, which is a method to equip pre - trained language models with chat capabilities in new languages by adding a vector obtained from the weight difference between a base model and its chat - optimized counterpart.
Performance
Our model achieves state - of - the - art (SOTA) zero - shot performance on the MultiVENT2.0 dataset as of July 2025. See the official leaderboard for detailed results.
๐ง Technical Details
The model is based on the architecture of Qwen/Qwen2-VL-7B-Instruct
and uses specific fine - tuning and vector - adding methods to enhance its performance in video retrieval. The fine - tuning on the video preference dataset and the addition of the retrieval vector contribute to its high accuracy and generalization ability.
๐ License
The project is licensed under the cc - by - nc - 4.0
license.