LLM2CLIP-Openai-L-14-224 Open-source Model - Unleash the Potential of CLIP and Enhance Text Discrimination Ability

LLM2CLIP Openai L 14 224

Developed by microsoft

LLM2CLIP is an innovative approach that leverages large language models (LLMs) to unlock the potential of CLIP. It enhances text discriminability through a contrastive learning framework, breaking the limitations of the original CLIP text encoder.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Zero-shot classification #Cross-modal retrieval #Long text understanding

Downloads 108

Release Time : 11/19/2024

Model Overview

LLM2CLIP fine-tunes the LLM in the title space under a contrastive learning framework, extracting its text capabilities into the output embeddings and significantly enhancing the text discriminability of the output layer. Subsequently, an efficient training process is designed, using the fine-tuned LLM as a powerful teacher model for the CLIP visual encoder.

Model Features

Breaking the limitations of the CLIP text encoder

By introducing LLMs, longer and more complex captions can be used, breaking the context window and capability limitations of the original CLIP text encoder.

Cross-language capability

Transform a CLIP model trained only on English data into a state-of-the-art cross-language model.

Performance improvement

In long-text and short-text retrieval tasks, the performance of the previous SOTA model EVA02 is improved by 16.5%.

Multimodal compatibility

When combined with multimodal models such as Llava 1.5, it consistently outperforms CLIP in almost all benchmark tests.

Model Capabilities

Zero-shot classification

Cross-modal retrieval

Long text processing

Cross-language conversion

Use Cases

Image retrieval

Long-text image retrieval

Use longer and more complex captions for image retrieval

Performance improvement of 16.5%

Cross-language applications

Cross-language image retrieval

Apply a model trained on English to image retrieval in other languages

Become a state-of-the-art cross-language model

🚀 LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

A novel approach that uses LLMs to unlock CLIP’s potential, improving cross - modal task performance.

🚀 Quick Start

In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP’s potential. By fine - tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability. We then design an efficient training process where the fine - tuned LLM acts as a powerful teacher for CLIP’s visual encoder. Thanks to the LLM’s presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP text encoder’s context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross - modal tasks. Our method directly boosted the performance of the previously SOTA EVA02 model by 16.5% on both long - text and short - text retrieval tasks, transforming a CLIP model trained solely on English data into a state - of - the - art cross - lingual model. Moreover, when integrated into multi - modal training with models like Llava 1.5, it consistently outperformed CLIP across nearly all benchmarks, demonstrating comprehensive performance improvements.

✨ Features

Enhanced Textual Discriminability: By fine - tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability.
Longer and More Complex Captions: Thanks to the LLM’s presence, we can incorporate longer and more complex captions without being restricted by vanilla CLIP text encoder’s context window and ability limitations.
Improved Cross - Modal Performance: Our approach brings substantial improvements in cross - modal tasks, boosting the performance of previous SOTA models and enabling cross - lingual capabilities.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Image Embeddings

from PIL import Image
from transformers import AutoModel
from transformers import CLIPImageProcessor
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

image_path = "CLIP.png"
model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-224" # or /path/to/local/LLM2CLIP-Openai-L-14

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14")
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).to('cuda').eval()

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
    outputs = model.get_image_features(input_pixels)

Retrieval

from PIL import Image
from transformers import AutoModel, AutoConfig, AutoTokenizer
from transformers import CLIPImageProcessor
import torch
from llm2vec import LLM2Vec
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14")
model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-224" # or /path/to/local/LLM2CLIP-Openai-L-14
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).to('cuda').eval()

llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
config = AutoConfig.from_pretrained(
    llm_model_name, trust_remote_code=True
)
llm_model = AutoModel.from_pretrained(llm_model_name, torch_dtype=torch.bfloat16, config=config,trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' #  Workaround for LLM2VEC
l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)

captions = ["a diagram", "a dog", "a cat"]
image_path = "CLIP.png"

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.get_image_features(input_pixels)
    text_features = model.get_text_features(text_features)

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

📚 Documentation

Model Details

Property	Details
Model Type	vision foundation model, feature backbone
Pretrain Dataset	CC3M, CC12M, YFCC15M and Recap - DataComp - 1B(30M subset)

Important Note

⚠️ Important Note

It's important to note that all results presented in the paper are evaluated using PyTorch weights. There may be differences in performance when using Hugging Face (hf) models.

📄 License

The project is licensed under the Apache - 2.0 license.

BibTeX & Citation

@misc{huang2024llm2clippowerfullanguagemodel,
      title={LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation}, 
      author={Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu},
      year={2024},
      eprint={2411.04997},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.04997}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご