L

LLM2CLIP Llama 3 8B Instruct CC Finetuned

Developed by microsoft
LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.
Downloads 18.16k
Release Time : 11/16/2024

Model Overview

This method fine-tunes LLM through contrastive learning, transferring its text capabilities to CLIP's output embedding layer, breaking through the limitations of the original CLIP text encoder and supporting longer and more complex descriptive texts.

Model Features

LLM-enhanced Text Representation
Improves text embedding quality by fine-tuning large language models, overcoming the text encoding limitations of the original CLIP
Long-text Support
Supports text input of up to 512 tokens, handling more complex descriptive content
Cross-lingual Capability
Achieves excellent cross-lingual retrieval performance with only English training data
Multimodal Compatibility
Seamlessly integrates with vision-language models like Llava, comprehensively surpassing the performance of the original CLIP

Model Capabilities

Image Feature Extraction
Cross-modal Retrieval
Zero-shot Classification
Multimodal Understanding
Long-text Processing

Use Cases

Image Retrieval
Complex Description Image Search
Search for relevant images using natural language long descriptions
Performance improved by 16.5% on long-text retrieval tasks
Cross-lingual Applications
Non-English Image Retrieval
Query relevant images using non-English text
Elevates the English-trained model to state-of-the-art cross-lingual performance
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase