OmniEmbed-v0.1開源多模態嵌入模型 - 支持跨語言文本、音視頻統一表示

Home

Omniembed V0.1

Developed by Tevatron

基於Qwen2.5-Omni-7B構建的多模態嵌入模型，支持跨語言文本、圖像、音頻和視頻的統一嵌入表示

多模態融合

Safetensors

Open Source License:MIT #多模態嵌入 #跨模態檢索 #統一文檔檢索

Downloads 2,190

Release Time : 4/12/2025

Model Overview

OmniEmbed是一個多模態嵌入模型，能夠生成跨語言文本、圖像、音頻和視頻的統一嵌入表示，為多樣化應用提供高效的跨模態檢索能力。

Model Features

多模態統一嵌入

支持文本、圖像、音頻和視頻的統一嵌入表示，實現跨模態檢索

跨語言能力

支持多語言文本檢索，性能接近專業多語言檢索模型

高性能檢索

在多個基準測試中表現優異，與專業單模態模型相當

開源訓練

訓練數據和訓練代碼已在Tevatron完全開源

Model Capabilities

文本檢索

圖像文檔檢索

視頻檢索

音頻檢索

多語言檢索

Use Cases

多媒體檢索

視頻檢索

根據文本查詢檢索相關視頻內容

在MSRVTT數據集上R@1達到51.3，優於CLIP基線

音頻檢索

根據文本描述檢索相關音頻片段

在AudioCaps數據集上R@1達到34.0，優於現有基線

文檔檢索

圖像文檔檢索

從包含圖像/圖表的文檔中檢索相關信息

在VIDORE數據集上nDCG@5達到85.8

多語言檢索

跨語言文本檢索

在MIRACL數據集上nDCG@10達到69.1

🚀 Tevatron/OmniEmbed-v0.1

OmniEmbed 是一個強大的多模態嵌入模型，它基於 Qwen2.5-Omni-7B 構建，並使用了我們的 Tevatron 工具包。Tevatron 是一個跨規模、語言和模態的統一文檔檢索工具包。OmniEmbed 能夠為多語言文本、圖像、音頻和視頻生成統一的嵌入表示，從而實現有效的跨模態檢索，適用於各種不同的應用場景。

📝 文本 🖼️ 圖像 🎧 音頻 🎥 視頻 🌐 多語言

✨ 主要特性

基於強大的 Qwen2.5-Omni-7B 模型構建。
使用統一的 Tevatron 工具包，跨規模、語言和模態進行文檔檢索。
能夠為多語言文本、圖像、音頻和視頻生成統一的嵌入表示。
支持有效的跨模態檢索，適用於多種應用場景。

📦 安裝指南

文檔中未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

# Import Library, Load Model and Processor
import torch
from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration
from qwen_omni_utils import process_mm_info

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    'ArvinZhuang/OmniEmbed-test',
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
).to(device).eval()

processor.tokenizer.padding_side = "left"
model.padding_side = "left"

# Function to Encode Message
def encode_message(message):
    texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>"
    audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True)

    inputs = processor(
        text=texts,
        audio=audio_inputs,
        images=image_inputs,
        videos=video_inputs,
        return_tensors="pt",
        padding="longest",
    )
    for k in inputs:
        inputs[k] = inputs[k].to(device)

    cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device)
    inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position)
    model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

    last_hidden_state = model_outputs.hidden_states[-1]
    reps = last_hidden_state[:, -1]
    reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
    return reps

高級用法

🎬 視頻檢索

example_query = 'Query: How to cook Mapo Tofu?'
example_video_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/mapo_tofu.mp4"
example_video_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/zhajiang_noodle.mp4"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
video_1 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_1}]}]
video_2 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(video_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(video_2))

print("Similarities:", sim1.item(), sim2.item())

🎵 音頻檢索

example_query = 'Query: A light piano piece'
example_audio_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/joe_hisaishi_summer.mp3"
example_audio_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/jay_chou_superman_cant_fly.mp3"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
audio_1 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_1}]}]
audio_2 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(audio_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(audio_2))

print("Similarities:", sim1.item(), sim2.item())

📈 圖像文檔檢索（圖像、圖表、PDF）

example_query = 'Query: How many input modality does Qwen2.5-Omni support?'
example_image_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/qwen2.5omni_hgf.png"
example_image_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/llama4_hgf.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))

print("Similarities:", sim1.item(), sim2.item())

🌍 多語言文本檢索

example_query = 'Query: 氧氣在空氣中佔比多少？'
example_text_1 = "空氣是指大氣層中由不同氣體和各類飄浮在其中的固體與液體顆粒（大氣顆粒與氣溶膠）所組成的氣態混合物。地球大氣層的空氣主要由78.1%的氮氣、20.9%氧氣、0.9%的氬氣和1~4%的水蒸氣組成，其成分並不是固定的，隨著高度、氣壓、溫度的改變和對流情況不同，局部空氣的組成比例也會改變。空氣在大氣層（特別是對流層）中的流動形成了風和曳流、氣旋、龍捲等自然現象，而空氣中飄浮的顆粒則形成了雲、霧、霾和沙塵暴等短期天氣情況。空氣在海洋和陸地之間跨區域流動所承載的溼度和熱能傳導也是水循環和氣候變率與變化的關鍵一環。"
example_text_2 = "水（化學式：H2O）是一種無機化合物，在常溫且無雜質中是無色[1]無味不導電的透明液體，也會通過蒸發產生氣態的水蒸氣（這種蒸發可以發生在任何溫度下，同時取決於與空氣接觸的表面積和溼度差）。在標準大氣壓下，水的凝固點是0 °C（32 °F；273 K），沸點是100 °C（212 °F；373 K）。"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))

print("Similarities:", sim1.item(), sim2.item())

📚 詳細文檔

評估結果

基準測試	任務	指標	OmniEmbed	基線模型（分數）
BEIR - 13	文本檢索	nDCG@10	58.2	MistralE5（59.0）
MIRACL	多語言檢索	nDCG@10	69.1	BGE‑M3（69.2）
VIDORE	圖像文檔檢索	nDCG@5	85.8	DSE‑QWen2（85.8）
MSRVTT	視頻檢索	R@1	51.3	CLIP（31.2）
AudioCaps	音頻檢索	R@1	34.0	*[CE](https://paperswithcode.com/sota/text - to - audio - retrieval - on - audiocaps)（23.1）