# Multimodal Image Captioning

Qwen2.5 VL 7B Captioner Relaxed GGUF
Apache-2.0
Qwen2.5-VL-7B-Captioner-Relaxed is a multimodal vision-language model based on the Qwen2.5 architecture, focusing on image-to-text generation tasks.
Image-to-Text English
Q
samgreen
320
1
Qwen2.5 VL 7B Captioner Relaxed
Apache-2.0
A multimodal large language model fine-tuned based on Qwen2.5-VL-7B-Instruct, specifically optimized for text-to-image generation, capable of producing more detailed image descriptions
Image-to-Text Transformers English
Q
Ertugrul
1,339
12
Qwen2.5 VL 3B Instruct MLX 8bits
This is an 8-bit quantized version of the Qwen2.5-VL-3B-Instruct model, optimized for the MLX framework and supports image-text generation tasks.
Image-to-Text Transformers English
Q
moot20
27
1
Qwen2 VL 7B Captioner Relaxed
Apache-2.0
An instruction-tuned version based on Qwen2-VL-7B-Instruct, focusing on generating more detailed image descriptions, optimized for text-to-image dataset creation.
Image-to-Text Transformers English
Q
Ertugrul
4,080
53
Blip
Bsd-3-clause
BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks, capable of generating accurate natural language descriptions based on image content.
Image-to-Text Transformers
B
upro
19
2
Blip Image Captioning Large
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in image caption generation and understanding tasks, efficiently utilizing web data through guided annotation strategies
Image-to-Text Transformers
B
movementso
18
0
Vinvl Base Image Captioning
Apache-2.0
Microsoft's VinVL foundational pre-trained model, specifically designed for image captioning tasks, with strong visual-language understanding capabilities.
Image-to-Text
V
michelecafagna26
45
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase