L

LLM2CLIP Openai L 14 224

Developed by microsoft
LLM2CLIP is an innovative approach that leverages large language models (LLMs) to unlock the potential of CLIP. It enhances text discriminability through a contrastive learning framework, breaking the limitations of the original CLIP text encoder.
Downloads 108
Release Time : 11/19/2024

Model Overview

LLM2CLIP fine-tunes the LLM in the title space under a contrastive learning framework, extracting its text capabilities into the output embeddings and significantly enhancing the text discriminability of the output layer. Subsequently, an efficient training process is designed, using the fine-tuned LLM as a powerful teacher model for the CLIP visual encoder.

Model Features

Breaking the limitations of the CLIP text encoder
By introducing LLMs, longer and more complex captions can be used, breaking the context window and capability limitations of the original CLIP text encoder.
Cross-language capability
Transform a CLIP model trained only on English data into a state-of-the-art cross-language model.
Performance improvement
In long-text and short-text retrieval tasks, the performance of the previous SOTA model EVA02 is improved by 16.5%.
Multimodal compatibility
When combined with multimodal models such as Llava 1.5, it consistently outperforms CLIP in almost all benchmark tests.

Model Capabilities

Zero-shot classification
Cross-modal retrieval
Long text processing
Cross-language conversion

Use Cases

Image retrieval
Long-text image retrieval
Use longer and more complex captions for image retrieval
Performance improvement of 16.5%
Cross-language applications
Cross-language image retrieval
Apply a model trained on English to image retrieval in other languages
Become a state-of-the-art cross-language model
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase