GME-VARCO-VISION-Embedding Open-Source Multimodal Embedding Model - Accurately Calculate Semantic Similarity to Assist Video Retrieval

GME VARCO VISION Embedding

Developed by NCSOFT

GME-VARCO-VISION-Embedding is a multimodal embedding model that focuses on calculating the semantic similarity between text, images, and videos in a high-dimensional embedding space, and is particularly good at video retrieval tasks.

Multimodal Fusion

Transformers

English#Multimodal video retrieval #Zero-shot learning #High-dimensional semantic embedding

Downloads 789

Release Time : 6/10/2025

Model Overview

This model can calculate the semantic similarity between text, images, and videos in a high-dimensional embedding space, focuses on video retrieval tasks, and has high retrieval accuracy and strong generalization performance.

Model Features

Multimodal embedding

It can process data in three modalities: text, images, and videos, and calculate the semantic similarity between them in a high-dimensional embedding space.

Video retrieval focus

The video retrieval ability is specially optimized, which requires higher complexity and context understanding ability compared to image retrieval.

Contrastive learning fine-tuning

Fine-tuning is performed using the 17k video preference dataset of ShareGPTVideo through contrastive learning, which improves the retrieval performance of the model.

Retrieval vector enhancement

The generalization ability of the model is enhanced by adding the retrieval vector obtained from the weight difference between the base model and its retrieval-optimized version.

Model Capabilities

Text-image retrieval

Text-video retrieval

Multimodal feature extraction

Semantic similarity calculation

Use Cases

Video retrieval

Scene-based video search

Retrieve relevant video clips based on scene descriptions

High retrieval accuracy

Description-based video search

Retrieve relevant video content based on text descriptions

Strong generalization performance

Question-answer-based video search

Retrieve relevant video answers based on questions

Accurate context understanding

Image retrieval

Description-based image search

Retrieve relevant images based on text descriptions

Efficient semantic matching

🚀 GME-VARCO-VISION-Embedding

GME-VARCO-VISION-Embedding is a multimodal embedding model that computes semantic similarity between text, images, and videos in a high - dimensional embedding space, excelling in video retrieval tasks.

🚀 Quick Start

GME-VARCO-VISION-Embedding is a powerful multimodal embedding model. It can calculate semantic similarity among text, images, and videos in a high - dimensional embedding space. Specifically, it shows great performance in video retrieval, which is more complex than image retrieval and requires a deeper understanding of context. It achieves high retrieval accuracy and strong generalization ability in various scenarios, such as scene - based search, description - based search, and question - answering - based search.

✨ Features

Multimodal Embedding: Computes semantic similarity between text, images, and videos.
Video Retrieval Focus: Specialized for video retrieval tasks with high accuracy and generalization.
SOTA Performance: Achieves state - of - the - art zero - shot performance on the MultiVENT2.0 dataset as of July 2025.

📦 Installation

The document does not provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

Image - Text Retrieval

import torch
import requests
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "NCSOFT/GME-VARCO-VISION-Embedding"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
device = model.device


qry_msg = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Find a photo of a cat."},
        ],
    },
]

qry_txt = processor.apply_chat_template(
    qry_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token

qry_input = processor(
            text=[qry_txt],
            padding=True,
            return_tensors="pt",
        ).to(device)  


img_msg = [
    {
        "role": "user",
        "content": [{
            "type": "image",
            "image": "image"
        }]
    }
]

img_txt = processor.apply_chat_template(
    img_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token


candidate_imgs= [
        # Photo of two cats
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image": "http://images.cocodataset.org/val2017/000000039769.jpg"}]
        },
        # Photo of two dogs
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image": "https://farm1.staticflickr.com/116/290755713_a5de6c1079_z.jpg"}]
        },
        # photo of two people playing baseball
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image": "http://farm3.staticflickr.com/2418/2193688811_d9f5e23bbd_z.jpg"}]
        },
        # Photo of a large crowd in a busy city street
        {
            "role": "user",
            "content": [{
                "type": "image",
                "image":"http://farm7.staticflickr.com/6049/6329686751_997c68fff9_z.jpg"}]
        },
    ]

candidate_images, _ = process_vision_info(candidate_imgs)

image_inputs = processor(
        text=[img_txt] * len(candidate_images),
        images=candidate_images,
        # videos=,
        padding=True,
        return_tensors="pt",
    ).to(device)

with torch.inference_mode():
    qry_emb = model(
        **qry_input, output_hidden_states=True, return_dict=True
    ).hidden_states[-1][:, -1, :]

    img_emb = model(
        **image_inputs, output_hidden_states=True, return_dict=True
    ).hidden_states[-1][:, -1, :]

qry_emb = F.normalize(qry_emb, dim=-1)    
img_emb = F.normalize(img_emb, dim=-1)

score = qry_emb @ img_emb.t()
# tensor([[0.3066, 0.1108, 0.1226, 0.1245]], device='cuda:0', dtype=torch.bfloat16)
# corresponding to the score of photos (cat, dog, baseball, crowd)

Video Embedding

vid_message = [
        {
            "role": "user",
            "content": [{
                "type": "video",
                "video": video_path,
                "max_pixels": 262144,
                "fps": 2.0,}]
        }
    ]

video_text = processor.apply_chat_template(
            vid_message, tokenize=False, add_generation_prompt=True
        ) + tokenizer.eos_token

image_input, video_input = process_vision_info(vid_message)

video_input = processor(
        text=[video_text],
        images=image_input,
        videos=video_input,
        padding=True,
        return_tensors="pt",
    ).to(device)

with torch.inference_mode():
    video_emb = model(
        **video_input, output_hidden_states=True, return_dict=True
    ).hidden_states[-1][:, -1, :]

video_emb = F.normalize(video_emb, dim=-1)

📚 Documentation

Demo Video

Check out our demo videos showcasing our multimodal embedding model in action:

The demo demonstrates how our embedding model works together with an AI agent to search for relevant videos based on user queries and generate responses using the retrieved video content.

Model Architecture and Training Method

GME-VARCO-VISION-Embedding is based on Qwen/Qwen2-VL-7B-Instruct, and uses the parameters of Alibaba-NLP/gme-Qwen2-VL-7B-Instruct to improve the model's general retrieval ability.

1. Fine - tuning (Contrastive Learning) on video preference dataset

To efficiently fine - tune the model, we utilize ShareGPTVideo’s 17𝑘 video preference dataset, which includes prompts, videos, gold answers, and chosen - rejected text pairs. We treat the prompts and videos as queries, and the rejected responses as hard - negatives for the gold answers. Each query is trained with in - batch negatives as well as one hard negative using the InfoNCE loss. The model is fully fine - tuned for two epochs on 8 A100 GPUs with a batch size of 8, requiring only a few hours for training.

2. Adding Retrieval Vector

To compensate for the insufficiency of training instances and enhance the generalization ability of the fine - tuned model, we compute a retrieval vector 𝜏 by subtracting the weights of the original Qwen/Qwen2-VL-7B-Instruct model from those of Alibaba-NLP/gme-Qwen2-VL-7B-Instruct, a Qwen2 - VL based image - text retrieval model. This approach is inspired by Chat Vector, which is a method to equip pre - trained language models with chat capabilities in new languages by adding a vector obtained from the weight difference between a base model and its chat - optimized counterpart.

Performance

Our model achieves state - of - the - art (SOTA) zero - shot performance on the MultiVENT2.0 dataset as of July 2025. See the official leaderboard for detailed results.

🔧 Technical Details

The model is based on the architecture of Qwen/Qwen2-VL-7B-Instruct and uses specific fine - tuning and vector - adding methods to enhance its performance in video retrieval. The fine - tuning on the video preference dataset and the addition of the retrieval vector contribute to its high accuracy and generalization ability.

📄 License

The project is licensed under the cc - by - nc - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご