# Vision-language generation
Clip Flant5 Xxl
Apache-2.0
A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks
Image-to-Text
Transformers English

C
zhiqiulin
86.23k
2
Blip2 Opt 6.7b Coco
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.
Image-to-Text
Transformers English

B
Salesforce
216.79k
33
Featured Recommended AI Models