UniME-LLaVA-1.6-7B Open-Source Multimodal Embedding Model: Trained at High Resolution and Tops the MMEB Rankings!

Unime LLaVA 1.6 7B

Developed by DeepGlint-AI

UniME is a general embedding learning model based on a multimodal large model, trained with 336×336 image resolution and ranked first on the MMEB leaderboard.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Multimodal Embedding Learning #Cross-modal Retrieval #Knowledge Distillation

Downloads 188

Release Time : 4/25/2025

Model Overview

UniME enhances the embedding capabilities of multimodal large models through text-discriminative knowledge distillation and hard negative mining instruction tuning, suitable for cross-modal retrieval tasks.

Model Features

Text-Discriminative Knowledge Distillation

Aligns the embedding of the student model with the teacher model in batch similarity distribution via KL divergence, fine-tuning only the LLM component while freezing all other parameters.

Hard Negative Mining

Employs a similarity threshold-based false negative filtering mechanism to eliminate misleading samples and automatically selects top-k similar but mismatched samples to increase training difficulty.

High-Resolution Training

Trained with 336×336 image resolution to enhance visual detail capture capability.

Model Capabilities

Cross-modal Retrieval

Image Understanding

Text Understanding

Embedding Learning

Use Cases

Cross-modal Retrieval

Image-Text Matching

Computes the similarity between images and text descriptions

Achieved outstanding performance in MMEB evaluation

🚀 Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

UniME is a project that aims to break the modality barrier through universal embedding learning with multimodal LLMs. It achieves the top ranking on the MMEB leaderboard when training with a 336×336 image resolution.

Tiancheng Gu*, Kaicheng Yang*, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng

🏡 Project Page | 📄 Paper | 💻 Github

🚀 Quick Start

Installation

git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt

Usage Examples

Basic Usage

import torch
from PIL import Image
from torch.nn import functional as F
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration


base_model_path="DeepGlint-AI/UniME-LLaVA-1.6-7B"
img_prompt = "[INST] <image>\nSummary above image in one word: [/INST]"
text_prompt = "[INST] <sent>\nSummary above sentence in one word: [/INST]"

text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_texts = text_prompt.replace('<sent>', text)
input_image_prompt = img_prompt
input_image = [Image.open(image_path)]

transform = LlavaNextProcessor.from_pretrained(base_model_path)
model = LlavaNextForConditionalGeneration.from_pretrained(base_model_path, device_map="cuda", torch_dtype=torch.float16, low_cpu_mem_usage=True) 
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True

inputs_text = transform(text=input_texts,
                    images=None,
                    return_tensors="pt", 
                    padding=True)
for key in inputs_text: inputs_text[key] = inputs_text[key].to("cuda")
inputs_image = transform(text=input_image_prompt,
                    images=input_image, 
                    return_tensors="pt", 
                    padding=True).to("cuda")

with torch.no_grad():
  emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_text = F.normalize(emb_text, dim=-1)
  emb_image = F.normalize(emb_image, dim=-1)
  Score = emb_image @ emb_text.T
print("Score: ", Score)

✨ Features

💡 Highlights

To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen.

After that, we propose hard negative enhanced instruction tuning enhances multimodal systems by improving visual sensitivity, strengthening cross-modal alignment, and boosting instruction-following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top-k similar but non-matching examples to increase training difficulty.

📚 Documentation

🔢 Results

Diverse Retrieval

MMEB

📄 License

This project is licensed under the MIT license.

📖 Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{gu2025breakingmodalitybarrieruniversal,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
      author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2504.17432},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.17432}, 
}

📋 Information Table

Property	Details
Model Type	llava-hf/llava-v1.6-mistral-7b-hf
Training Data	TIGER-Lab/MMEB-train
Pipeline Tag	image-text-to-text
Library Name	transformers
Metrics	recall

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご