Open-source UniME-Phi3.5-V-4.2B Model - Breaking Modal Barriers for Cross-modal Retrieval and Embedding Learning

Unime Phi3.5 V 4.2B

Developed by DeepGlint-AI

UniME is a general embedding learning model based on a multimodal large model, focusing on breaking down modal barriers to achieve cross-modal retrieval and embedding learning.

Multimodal Alignment

Transformers

EnglishOpen Source License:MIT #Multimodal Embedding #Text-Image Retrieval #Knowledge Distillation

Downloads 54

Release Time : 4/25/2025

Model Overview

UniME employs text discriminative knowledge distillation and hard negative sample-enhanced instruction tuning to enhance the embedding capabilities of multimodal large models, supporting cross-modal retrieval for both images and text.

Model Features

Text Discriminative Knowledge Distillation

Aligns the embeddings of the student and teacher models in batch similarity distributions using KL divergence, fine-tuning only the language model component while keeping other parameters frozen.

Hard Negative Sample-Enhanced Instruction Tuning

Uses a similarity threshold-based false negative sample filtering mechanism and an automatic hard negative sample sampling strategy to improve visual sensitivity, strengthen cross-modal alignment, and enhance instruction-following capabilities.

High-Resolution Image Processing

Supports training with 336×336 image resolution and delivers outstanding performance in multimodal embedding benchmarks.

Model Capabilities

Image Embedding

Text Embedding

Cross-Modal Retrieval

Multimodal Alignment

Use Cases

Cross-Modal Retrieval

Image-to-Text Retrieval

Retrieve relevant text descriptions based on image content.

Ranked first on the MMEB leaderboard.

Text-to-Image Retrieval

Retrieve relevant images based on text descriptions.

Performs excellently in diverse retrieval tasks.

🚀 Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

UniME achieves top - ranking on the MMEB leaderboard, breaking the modality barrier in multimodal embedding learning.

🏡 Project Page | 📄 Paper | 💻 Github

UniME achieves the top ranking on the MMEB leaderboard training using a 336×336 image resolution.（The screenshot is captured at 08:00 UTC+8 on May 6, 2025.）

✨ Features

💡 Highlights

To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV - Embed V2) embeddings via KL divergence on batch - wise similarity distributions. Notably, only the LLM component is fine - tuned during this process, while all other parameters remain frozen.

After that, we propose hard negative enhanced instruction tuning enhances multimodal systems by improving visual sensitivity, strengthening cross - modal alignment, and boosting instruction - following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top - k similar but non - matching examples to increase training difficulty.

🚀 Quick Start

📦 Installation

git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from torch.nn import functional as F
from transformers import AutoProcessor, AutoModelForCausalLM

base_model_path="DeepGlint-AI/UniME-Phi3.5-V-4.2B"
img_prompt = '<|user|>\n<|image_1|>\nSummary above image in one word: <|end|>\n<|assistant|>\n'
text_prompt = '<|user|>\n<sent>\nSummary above sentence in one word: <|end|>\n<|assistant|>\n'

text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_texts = text_prompt.replace('<sent>', text)
input_image_prompt = img_prompt
input_image = [Image.open(image_path)]

transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True, torch_dtype=torch.float16, _attn_implementation='flash_attention_2')
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True

inputs_text = transform(text=input_texts,
                    images=None,
                    return_tensors="pt", 
                    padding=True)
for key in inputs_text: inputs_text[key] = inputs_text[key].to("cuda")
inputs_image = transform(text=input_image_prompt,
                    images=input_image, 
                    return_tensors="pt", 
                    padding=True).to("cuda")

with torch.no_grad():
  emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_text = F.normalize(emb_text, dim=-1)
  emb_image = F.normalize(emb_image, dim=-1)
  Score = emb_image @ emb_text.T
print("Score: ", Score)

🔧 Results

Diverse Retrieval

MMEB

📚 Documentation

📖 Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{gu2025breakingmodalitybarrieruniversal,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
      author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2504.17432},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.17432}, 
}

📄 License

This project is licensed under the MIT License.

Property	Details
Model Type	microsoft/Phi - 3.5 - vision - instruct
Training Data	TIGER - Lab/MMEB - train
Library Name	transformers
Pipeline Tag	image - text - to - text
Tags	Retrieval, Multimodal, Embedding

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご