L

LLM2CLIP Openai B 16

Developed by microsoft
LLM2CLIP is an innovative method that leverages large language models (LLMs) to extend CLIP's capabilities, enhancing text discriminability through a contrastive learning framework and significantly improving cross-modal task performance.
Downloads 1,154
Release Time : 11/7/2024

Model Overview

LLM2CLIP fine-tunes LLMs' capabilities in caption space and uses them as teacher models for CLIP's visual encoder, overcoming the limitations of the original CLIP text encoder to support longer and more complex text inputs, significantly enhancing cross-modal task performance.

Model Features

LLM-enhanced Text Encoding
Fine-tuning LLMs under a contrastive learning framework significantly improves the discriminative power of text embeddings.
Long-text Support
Overcomes the original CLIP's text length limitations to support longer and more complex text inputs.
Cross-lingual Capabilities
Models trained only on English data demonstrate remarkable cross-lingual performance.
Multimodal Compatibility
Seamlessly integrates with multimodal models like Llava for comprehensive performance improvements.

Model Capabilities

Zero-shot image classification
Cross-modal retrieval
Long-text understanding
Multilingual support
Vision-language alignment

Use Cases

Image Retrieval
Long-text Image Retrieval
Retrieving relevant images using complex, long-text descriptions
16.5% performance improvement compared to the EVA02 model
Cross-lingual Image Retrieval
Retrieving images using text in different languages
Models trained on English demonstrate exceptional cross-lingual capabilities
Multimodal Applications
Integration with Llava 1.5
Combining with multimodal models to enhance visual understanding capabilities
Outperforms original CLIP in almost all benchmark tests
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase