C

Clip Flant5 Xxl

Developed by zhiqiulin
A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks
Downloads 86.23k
Release Time : 12/13/2023

Model Overview

This model is a version obtained by fine-tuning flan-t5-xxl for image-text retrieval tasks and is presented in the VQAScore paper

Model Features

Vision-language generation ability
Combine visual and language understanding abilities to achieve cross-modal retrieval between images and texts
Fine-tuned based on Flan-T5
Conduct targeted fine-tuning on the basis of the powerful Flan-T5-XXL, enhancing visual association ability while retaining the original language understanding ability
Related to VQAScore
The model design is related to the VQAScore evaluation method and may optimize visual question-answering related indicators

Model Capabilities

Image-text retrieval
Cross-modal understanding
Vision-language generation

Use Cases

Information retrieval
Image-based text retrieval
Retrieve relevant text descriptions based on image content
Cross-modal search
Implement bidirectional retrieval between images and texts
Visual question-answering
VQA system
May be used to build a visual question-answering system (inferred based on the association with VQAScore)
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase