🚀 [E5-V:基于多模态大语言模型的通用嵌入]
E5-V是一个用于实现多模态嵌入的框架,它基于MLLMs进行适配,有效弥合了不同类型输入之间的模态差距,即使在未微调的情况下,也能在多模态嵌入任务中展现出强大性能。同时,其单模态训练方法仅在文本对上进行训练,表现优于多模态训练。
🚀 快速开始
E5-V基于lmms-lab/llama3-llava-next-8b
进行微调。我们提出了名为E5-V的框架,用于适配MLLMs以实现多模态嵌入。E5-V有效地弥合了不同类型输入之间的模态差距,即使在未进行微调的情况下,也能在多模态嵌入中展现出强大的性能。我们还为E5-V提出了一种单模态训练方法,该模型仅在文本对上进行训练,其性能优于多模态训练。
更多详细信息可查看:https://github.com/kongds/E5-V
💻 使用示例
基础用法
import torch
import torch.nn.functional as F
import requests
from PIL import Image
from transformers import AutoTokenizer
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
llama3_template = '<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n'
processor = LlavaNextProcessor.from_pretrained('royokong/e5-v')
model = LlavaNextForConditionalGeneration.from_pretrained('royokong/e5-v', torch_dtype=torch.float16).cuda()
img_prompt = llama3_template.format('<image>\nSummary above image in one word: ')
text_prompt = llama3_template.format('<sent>\nSummary above sentence in one word: ')
urls = ['https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/American_Eskimo_Dog.jpg/360px-American_Eskimo_Dog.jpg',
'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Felis_catus-cat_on_snow.jpg/179px-Felis_catus-cat_on_snow.jpg']
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
texts = ['A dog sitting in the grass.',
'A cat standing in the snow.']
text_inputs = processor([text_prompt.replace('<sent>', text) for text in texts], return_tensors="pt", padding=True).to('cuda')
img_inputs = processor([img_prompt]*len(images), images, return_tensors="pt", padding=True).to('cuda')
with torch.no_grad():
text_embs = model(**text_inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
img_embs = model(**img_inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
text_embs = F.normalize(text_embs, dim=-1)
img_embs = F.normalize(img_embs, dim=-1)
print(text_embs @ img_embs.t())