clip-flant5-xxl Open-source Vision-Language Model - Free Deployment to Facilitate Image-Text Retrieval

Clip Flant5 Xxl

Developed by zhiqiulin

A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Image-text retrieval #Vision-language generation #Multimodal fine-tuning

Downloads 86.23k

Release Time : 12/13/2023

Model Overview

This model is a version obtained by fine-tuning flan-t5-xxl for image-text retrieval tasks and is presented in the VQAScore paper

Model Features

Vision-language generation ability

Combine visual and language understanding abilities to achieve cross-modal retrieval between images and texts

Fine-tuned based on Flan-T5

Conduct targeted fine-tuning on the basis of the powerful Flan-T5-XXL, enhancing visual association ability while retaining the original language understanding ability

Related to VQAScore

The model design is related to the VQAScore evaluation method and may optimize visual question-answering related indicators

Model Capabilities

Image-text retrieval

Cross-modal understanding

Vision-language generation

Use Cases

Information retrieval

Image-based text retrieval

Retrieve relevant text descriptions based on image content

Cross-modal search

Implement bidirectional retrieval between images and texts

Visual question-answering

VQA system

May be used to build a visual question-answering system (inferred based on the association with VQAScore)

Property	Details
Model Type	Vision - Language Generative Model
Training Data	Not provided in the original document
License	Apache - 2.0
Finetuned from Model	google/flan-t5-xxl

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Clip Flant5 Xxl

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 CLIP-FlanT5-XXL (VQAScore)

🚀 Quick Start

✨ Features

📚 Documentation

Model Sources

📄 License