Clip Flant5 Xxl
A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks
Downloads 86.23k
Release Time : 12/13/2023
Model Overview
This model is a version obtained by fine-tuning flan-t5-xxl for image-text retrieval tasks and is presented in the VQAScore paper
Model Features
Vision-language generation ability
Combine visual and language understanding abilities to achieve cross-modal retrieval between images and texts
Fine-tuned based on Flan-T5
Conduct targeted fine-tuning on the basis of the powerful Flan-T5-XXL, enhancing visual association ability while retaining the original language understanding ability
Related to VQAScore
The model design is related to the VQAScore evaluation method and may optimize visual question-answering related indicators
Model Capabilities
Image-text retrieval
Cross-modal understanding
Vision-language generation
Use Cases
Information retrieval
Image-based text retrieval
Retrieve relevant text descriptions based on image content
Cross-modal search
Implement bidirectional retrieval between images and texts
Visual question-answering
VQA system
May be used to build a visual question-answering system (inferred based on the association with VQAScore)
Featured Recommended AI Models
Š 2025AIbase