LLaVE-7B Open-Source Multimodal Embedding Model - Supports Embedding Representations of Text, Images, and Videos

Llave 7B

Developed by zhibinlan

LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.

Multimodal Fusion

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Embedding #Zero-shot Video Retrieval #Image-Text Contrastive Learning

Downloads 1,389

Release Time : 2/9/2025

Model Overview

LLaVE-7B is a multimodal embedding model that can process embedding representations for text, images, multiple images, and videos. It performs excellently on the MMEB leaderboard and demonstrates strong transfer learning capabilities.

Model Features

Multimodal Embedding Capability

Capable of embedding representations for text, images, multiple images, and videos simultaneously

Outstanding Performance

Achieved state-of-the-art performance on MMEB with only 662,000 training samples

Strong Transfer Ability

Although trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner

Efficient Training

Achieved excellent performance with only a small amount of data

Model Capabilities

Text Embedding Representation

Image Embedding Representation

Multi-image Embedding Representation

Video Embedding Representation

Cross-modal Retrieval

Zero-shot Transfer Learning

Use Cases

Information Retrieval

Cross-modal Retrieval

Retrieve relevant images or videos based on text queries

Ranked first on the MMEB leaderboard

Content Understanding

Image Content Understanding

Understand image content and generate relevant text representations

Can accurately distinguish different objects in images

🚀 LLaVE-7B

LLaVE-7B is a 7B parameter multimodal embedding model that can handle text, images, multi - images, and videos, achieving top - ranking on the MMEB leaderboard.

🚀 Quick Start

First clone our github:

git clone https://github.com/DeepLearnXMU/LLaVE
cd LLaVE
pip install -e ".[train]"

We provide the simple embedding process for using our model. For more details, you could refer to Github.

# pip install git+https://github.com/DeepLearnXMU/LLaVE


import torch
import copy
from PIL import Image
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images

pretrained = "zhibinlan/LLaVE-7B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()

# Image + Text -> Text
image = Image.open("figures/example.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models

question = DEFAULT_IMAGE_TOKEN + " Represent the given image with the following question: What is in the image"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], "\n")
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
attention_mask=input_ids.ne(tokenizer.pad_token_id)
image_sizes = [image.size]
query_embed = model.encode_multimodal_embeddings(input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=image_sizes)

target_string = "A cat and a dog"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], target_string)
conv.append_message(conv.roles[1], "\n")
target_string = conv.get_prompt()
target_input_ids = tokenizer(target_string, return_tensors="pt").input_ids.to(device)
attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask)

print("A cat and a dog similarity score: ", query_embed @ target_embed.T)
# 7B: A cat and a dog similarity score: tensor([[0.6240]]

neg_string = "A cat and a tiger"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], neg_string)
conv.append_message(conv.roles[1], "\n")
neg_string = conv.get_prompt()
neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
neg_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)
print("A cat and a tiger similarity score: ", query_embed @ neg_embed.T)
# 7B: A cat and a tiger similarity score: tensor([[0.4543]]


# Text -> Image
pos_string = "Find me an everyday image that matches the given caption: A cat and a dog."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], pos_string)
conv.append_message(conv.roles[1], "\n")
pos_string = conv.get_prompt()
pos_input_ids = tokenizer(pos_string, return_tensors="pt").input_ids.to(device)
attention_mask=pos_input_ids.ne(tokenizer.pad_token_id)
pos_query_embed = model.encode_multimodal_embeddings(pos_input_ids, attention_mask=attention_mask)

target = DEFAULT_IMAGE_TOKEN + " Represent the given image."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], target)
conv.append_message(conv.roles[1], "\n")
prompt_target = conv.get_prompt()
target_input_ids = tokenizer_image_token(prompt_target, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
target_image_sizes = [image.size]
target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=target_image_sizes)

print("A cat and a dog image similarity score: ", pos_query_embed @ target_embed.T)
# 7B: A cat and a dog similarity score: tensor([[0.5347]]

neg_string = "Find me an everyday image that matches the given caption: A cat and a tiger."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], neg_string)
conv.append_message(conv.roles[1], "\n")
neg_string = conv.get_prompt()
neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
neg_query_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)

print("A cat and a tiger image similarity score: ", neg_query_embed @ target_embed.T)
# 7B: A cat and a dog similarity score: tensor([[0.4001]]

✨ Features

The LLaVE models are 7B parameter multimodal embedding models based on the LLaVA - OneVision - 7B model with a context window of 4K tokens.
The model has the ability to embed with texts, images, multi - image and videos.
It can generalize to text - video retrieval tasks in a zero - shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

📦 Installation

The installation steps are as follows:

git clone https://github.com/DeepLearnXMU/LLaVE
cd LLaVE
pip install -e ".[train]"

📚 Documentation

Model Summary

The LLaVE models are 7B parameter multimodal embedding models based on the LLaVA - OneVision - 7B model with a context window of 4K tokens.

Repository: LLaVE
Paper: LLaVE

Train/Eval Data

Train data: https://huggingface.co/datasets/TIGER - Lab/MMEB - train
Eval data: https://huggingface.co/datasets/TIGER - Lab/MMEB - eval

Intended use

The model have the ability to embed with texts, images, multi - image and videos.

MMEB Leaderboard

We achieved the top ranking on the MMEB leaderboard using only a small amount of data.

MMEB Leaderboard

Model Performance

LLaVE - 7B achieved the SOTA performance on MMEB using only 662K training pairs. MMEB

Although LLaVE is trained on image - text data, it can generalize to text - video retrieval tasks in a zero - shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks. video - retrieve

Hardware & Software

GPUs: 16 * Ascend 910B GPUs (64GB) (for whole model training)
Orchestration: Huggingface Trainer
Neural networks: PyTorch

🔧 Technical Details

The model is based on the LLaVA - OneVision - 7B model, which is a 7B parameter multimodal embedding model with a context window of 4K tokens. It can handle various types of data including text, images, multi - images, and videos. The model achieved top - ranking on the MMEB leaderboard with a small amount of data and SOTA performance on MMEB using only 662K training pairs.

📄 License

The license of this project is apache - 2.0.

📄 Citation

@article{lan2025llave,
  title={LLaVE: Large Language and Vision Embedding Models with Hardness - Weighted Contrastive Learning},
  author={Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong},
  journal={arXiv preprint arXiv:2503.04812},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご