開源UniME-Phi3.5-V-4.2B模型 - 打破模態壁壘實現跨模態檢索與嵌入學習

首頁

Unime Phi3.5 V 4.2B

由DeepGlint-AI開發

UniME 是一個基於多模態大模型的通用嵌入學習模型，專注於打破模態壁壘，實現跨模態檢索和嵌入學習。

多模態對齊

Transformers

英語開源協議:MIT #多模態嵌入 #文本-圖像檢索 #知識蒸餾

下載量 54

發布時間 : 4/25/2025

模型概述

UniME 使用文本判別性知識蒸餾和硬負樣本增強的指令調優方法，增強多模態大模型的嵌入能力，支持圖像和文本的跨模態檢索。

模型特點

文本判別性知識蒸餾

通過KL散度對齊學生模型和教師模型在批次相似度分佈上的嵌入，僅微調語言模型組件，其餘參數保持凍結。

硬負樣本增強的指令調優

使用相似度閾值的假負樣本過濾機制和自動硬負樣本採樣策略，提升視覺敏感性、加強跨模態對齊和增強指令跟隨能力。

高分辨率圖像處理

支持336×336圖像分辨率訓練，在多模態嵌入基準測試中表現優異。

模型能力

圖像嵌入

文本嵌入

跨模態檢索

多模態對齊

使用案例

跨模態檢索

圖像到文本檢索

根據圖像內容檢索相關的文本描述。

在MMEB排行榜上位列第一。

文本到圖像檢索

根據文本描述檢索相關的圖像。

在多樣化檢索任務中表現優異。

🚀 打破模態壁壘：使用多模態大語言模型進行通用嵌入學習

UniME項目致力於打破模態之間的障礙，利用多模態大語言模型實現通用嵌入學習。該項目在MMEB排行榜上取得了優異成績，為多模態系統的發展提供了新的思路和方法。

項目信息

屬性	詳情
許可證	MIT
數據集	TIGER-Lab/MMEB-train
基礎模型	microsoft/Phi-3.5-vision-instruct
庫名稱	transformers
標籤	檢索、多模態、嵌入
任務類型	圖像文本到文本

作者信息

顧天成*，楊開誠*，馮子勇，王興軍，張彥釗，龍定坤，陳英達，蔡偉東，鄧建康

項目鏈接

🏡 項目主頁 | 📄 論文 | 💻 Github

項目成績

UniME在使用336×336圖像分辨率進行訓練時，在MMEB排行榜上名列前茅。（截圖於2025年5月6日UTC+8 08:00截取）

✨ 主要特性

文本判別式知識蒸餾

為了增強多模態大語言模型（MLLM）的嵌入能力，我們提出了文本判別式知識蒸餾方法。訓練過程包括解耦MLLM的大語言模型（LLM）組件，並使用提示“用一個詞總結上述句子”處理文本，然後通過KL散度在批量相似度分佈上對齊學生模型（MLLM）和教師模型（NV-Embed V2）的嵌入。值得注意的是，在此過程中僅微調LLM組件，而其他所有參數保持凍結。

硬負樣本增強指令調優

之後，我們提出了硬負樣本增強指令調優方法，通過提高視覺敏感性、加強跨模態對齊和提升指令遵循能力來增強多模態系統。其核心有兩個關鍵創新：一是使用相似度閾值的假負樣本過濾機制，以消除誤導性樣本；二是自動硬負樣本採樣策略，選擇前k個相似但不匹配的示例以增加訓練難度。

🚀 快速開始

克隆倉庫並創建環境

git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt

使用示例

import torch
from PIL import Image
from torch.nn import functional as F
from transformers import AutoProcessor, AutoModelForCausalLM

base_model_path="DeepGlint-AI/UniME-Phi3.5-V-4.2B"
img_prompt = '<|user|>\n<|image_1|>\nSummary above image in one word: <|end|>\n<|assistant|>\n'
text_prompt = '<|user|>\n<sent>\nSummary above sentence in one word: <|end|>\n<|assistant|>\n'

text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_texts = text_prompt.replace('<sent>', text)
input_image_prompt = img_prompt
input_image = [Image.open(image_path)]

transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True，torch_dtype=torch.float16, _attn_implementation='flash_attention_2')
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True

inputs_text = transform(text=input_texts,
                    images=None,
                    return_tensors="pt", 
                    padding=True)
for key in inputs_text: inputs_text[key] = inputs_text[key].to("cuda")
inputs_image = transform(text=input_image_prompt,
                    images=input_image, 
                    return_tensors="pt", 
                    padding=True).to("cuda")

with torch.no_grad():
  emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_text = F.normalize(emb_text, dim=-1)
  emb_image = F.normalize(emb_image, dim=-1)
  Score = emb_image @ emb_text.T
print("Score: ", Score)

📚 詳細文檔

多樣化檢索結果

MMEB結果

📄 許可證

本項目採用MIT許可證。

📖 引用

如果您發現本倉庫有用，請使用以下BibTeX條目進行引用。

@misc{gu2025breakingmodalitybarrieruniversal,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
      author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2504.17432},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.17432}, 
}