Open-source model of LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned - Enhance cross-modal abilities and improve image-text discrimination

LLM2CLIP Llama 3 8B Instruct CC Finetuned

Developed by microsoft

LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.

Multimodal Fusion

Safetensors

Open Source License:Apache-2.0 #Cross-modal Retrieval #Zero-shot Classification #Multilingual Support

Downloads 18.16k

Release Time : 11/16/2024

Model Overview

This method fine-tunes LLM through contrastive learning, transferring its text capabilities to CLIP's output embedding layer, breaking through the limitations of the original CLIP text encoder and supporting longer and more complex descriptive texts.

Model Features

LLM-enhanced Text Representation

Improves text embedding quality by fine-tuning large language models, overcoming the text encoding limitations of the original CLIP

Long-text Support

Supports text input of up to 512 tokens, handling more complex descriptive content

Cross-lingual Capability

Achieves excellent cross-lingual retrieval performance with only English training data

Multimodal Compatibility

Seamlessly integrates with vision-language models like Llava, comprehensively surpassing the performance of the original CLIP

Model Capabilities

Image Feature Extraction

Cross-modal Retrieval

Zero-shot Classification

Multimodal Understanding

Long-text Processing

Use Cases

Image Retrieval

Complex Description Image Search

Search for relevant images using natural language long descriptions

Performance improved by 16.5% on long-text retrieval tasks

Cross-lingual Applications

Non-English Image Retrieval

Query relevant images using non-English text

Elevates the English-trained model to state-of-the-art cross-lingual performance

🚀 LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

This paper presents LLM2CLIP, a novel approach leveraging the power of LLMs to unlock CLIP's potential. By fine - tuning the LLM in the caption space with contrastive learning, it enhances the output layer's textual discriminability. An efficient training process is designed where the fine - tuned LLM serves as a teacher for CLIP's visual encoder. This approach significantly improves cross - modal tasks, boosts the performance of existing models, and enables cross - lingual capabilities.

🚀 Quick Start

In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP’s potential. By fine - tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability. We then design an efficient training process where the fine - tuned LLM acts as a powerful teacher for CLIP’s visual encoder. Thanks to the LLM’s presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP text encoder’s context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross - modal tasks. Our method directly boosted the performance of the previously SOTA EVA02 model by 16.5% on both long - text and short - text retrieval tasks, transforming a CLIP model trained solely on English data into a state - of - the - art cross - lingual model. Moreover, when integrated into multi - modal training with models like Llava 1.5, it consistently outperformed CLIP across nearly all benchmarks, demonstrating comprehensive performance improvements.

LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

Weiquan Huang^1*, Aoqi Wu^1*, Yifan Yang^2†, Xufang Luo², Yuqing Yang², Liang Hu¹, Qi Dai², Xiyang Dai², Dongdong Chen², Chong Luo², Lili Qiu²

¹Tongji Universiy, ²Microsoft Corporation
^*Equal contribution
^† Corresponding to: yifanyang@microsoft.com

[📂 GitHub] [🆕 Blog] [📜 LLM2CLIP]

✨ Features

Enhanced Textual Discriminability: By fine - tuning the LLM in the caption space with contrastive learning, the output layer's textual discriminability is significantly improved.
Efficient Training Process: The fine - tuned LLM serves as a powerful teacher for CLIP's visual encoder, enabling the use of longer and more complex captions.
Improved Cross - Modal Performance: Substantial improvements in cross - modal tasks, such as boosting the performance of the EVA02 model and enabling cross - lingual capabilities.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

Image Embeddings

from PIL import Image
from transformers import AutoModel
from transformers import CLIPImageProcessor
import torch

image_path = "CLIP.png"
model_name_or_path = "LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.float16,
    trust_remote_code=True).to('cuda').eval()

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
    outputs = model.get_image_features(input_pixels)

Retrieval

from PIL import Image
from transformers import AutoModel, AutoConfig, AutoTokenizer
from transformers import CLIPImageProcessor
import torch
from llm2vec import LLM2Vec
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).to('cuda').eval()

llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
config = AutoConfig.from_pretrained(
    llm_model_name, trust_remote_code=True
)
llm_model = AutoModel.from_pretrained(llm_model_name, torch_dtype=torch.bfloat16, config=config, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' #  Workaround for LLM2VEC
l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)

captions = ["a diagram", "a dog", "a cat"]
image_path = "CLIP.png"

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.get_image_features(input_pixels)
    text_features = model.get_text_features(text_features)

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

📚 Documentation

LLM2CLIP performance

**It's important to note that all results presented in the paper are evaluated using PyTorch weights. There may be differences in performance when using Hugging Face (hf) models.**

Model Details

Property	Details
Model Type	vision foundation model, feature backbone
Training Data	CC3M, CC12M, YFCC15M and Recap - DataComp - 1B(30M subset)

📄 License

This project is licensed under the Apache - 2.0 license.

BibTeX & Citation

@misc{huang2024llm2clippowerfullanguagemodel,
      title={LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation}, 
      author={Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu},
      year={2024},
      eprint={2411.04997},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.04997}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご