UniME-LLaVA-OneVision-7B Open-Source Multimodal Model - A Practical Choice for Enhancing Multimodal Embedding Capabilities

Unime LLaVA OneVision 7B

Developed by DeepGlint-AI

UniME is a general embedding learning framework based on multimodal large models, significantly enhancing multimodal embedding capabilities through text discriminative knowledge distillation and hard negative sample-enhanced instruction tuning strategies.

Multimodal Alignment

Transformers

EnglishOpen Source License:MIT #Multimodal Embedding Learning #Text Discriminative Distillation #Hard Negative Sample Enhancement

Downloads 376

Release Time : 5/6/2025

Model Overview

UniME aims to break through modal barriers and enhance the embedding capabilities of multimodal large models through innovative training methods, achieving excellent performance on the MMEB leaderboard.

Model Features

Text Discriminative Knowledge Distillation

By decoupling the LLM component of the large model, processing text with prompts, and aligning the embedding vectors of the student model with the teacher model based on KL divergence, only the LLM component is fine-tuned.

Hard Negative Sample Enhancement

Adopts a false negative sample filtering mechanism based on similarity thresholds and an automatic selection strategy for top-k similar but mismatched samples to increase training difficulty and improve model performance.

Multimodal Embedding Optimization

Optimizes the multimodal system by enhancing visual sensitivity, strengthening cross-modal alignment, and improving instruction-following capabilities.

Model Capabilities

Multimodal Embedding Learning

Image Text Understanding

Cross-modal Retrieval

Text Summarization

Use Cases

Information Retrieval

Cross-modal Retrieval

Retrieve relevant text descriptions based on images, or retrieve relevant images based on text

Performs excellently in MMEB evaluations

Content Understanding

Image Content Summarization

Summarize image content with concise words

🚀 Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

UniME breaks the modality barrier by achieving top - ranking on the MMEB leaderboard, leveraging innovative embedding learning techniques with multimodal LLMs.

👀 Project Page | 📄 Paper | 💻 Github

UniME achieves the top ranking on the MMEB leaderboard training using a 336×336 image resolution. (The screenshot is captured at 08:00 UTC+8 on May 6, 2025.)

✨ Features

To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV - Embed V2) embeddings via KL divergence on batch - wise similarity distributions. Notably, only the LLM component is fine - tuned during this process, while all other parameters remain frozen.

After that, we propose hard negative enhanced instruction tuning enhances multimodal systems by improving visual sensitivity, strengthening cross - modal alignment, and boosting instruction - following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top - k similar but non - matching examples to increase training difficulty.

🚀 Quick Start

git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt
pip install transformers==4.49.0

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from torch.nn import functional as F
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

def appply_chat_template(image=None, text=None):
    if image != None:
        conversation_image = [{
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": "Summary above image in one word:\n"},
                    ],
            }]
    elif text!= None:
        conversation_image = [{
                "role": "user",
                "content": [
                    {"type": "text", "text": f"{text}\nSummary above sentence in one word:\n"},
                    ],
            }]
    return conversation_image

base_model_path="DeepGlint-AI/UniME-LLaVA-OneVision-7B"

text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_image = [Image.open(image_path)]

transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
model = LlavaOnevisionForConditionalGeneration.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True, torch_dtype=torch.float16)
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True

inputs_text = transform.apply_chat_template([appply_chat_template(text = text)],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_dict=True,
                                        return_tensors="pt",
                                        padding=True).to("cuda")
inputs_image = transform.apply_chat_template([appply_chat_template(image = input_image)],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_dict=True,
                                        return_tensors="pt",
                                        padding=True).to("cuda")

with torch.no_grad():
  emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_text = F.normalize(emb_text, dim=-1)
  emb_image = F.normalize(emb_image, dim=-1)
  Score = emb_image @ emb_text.T
print("Score: ", Score.item())

📚 Documentation

Diverse Retrieval

MMEB

📄 License

This project is licensed under the MIT license.

📋 Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{gu2025breakingmodalitybarrieruniversal,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
      author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2504.17432},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.17432}, 
}

Property	Details
Model Type	llava-hf/llava-onevision-qwen2-7b-ov-hf
Training Data	TIGER-Lab/MMEB-train
Pipeline Tag	image-text-to-text
Library Name	transformers
Metrics	recall

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご