# Image-to-Text Generation

Dimple 7B
Apache-2.0
Dimple is the first discrete diffusion multimodal large language model (DMLLM) that combines autoregressive and diffusion training paradigms. After training on the same dataset as LLaVA-NEXT, it outperforms LLaVA-NEXT-7B by 3.9%.
Image-to-Text Transformers English
D
rp-yu
422
3
Magma 8B GGUF
MIT
Magma-8B is an image-text-to-text conversion model based on the GGUF format, suitable for multimodal task processing.
Image-to-Text
M
Mungert
545
1
Llava 1.5 7b Hf Q4 K M GGUF
This model is a GGUF format conversion of llava-hf/llava-1.5-7b-hf, supporting image-to-text generation tasks.
Image-to-Text English
L
Marwan02
30
1
Gemma 3 27b It Qat 3bit
Other
This model is a 3-bit quantized version converted from google/gemma-3-27b-it-qat-q4_0-unquantized to the MLX format, suitable for image-to-text tasks.
Image-to-Text Transformers Other
G
mlx-community
197
2
Gemma 3 27b It Qat 4bit
Other
Gemma 3 27B IT QAT 4bit is an MLX-format model converted from Google's original model, supporting image-to-text tasks.
Image-to-Text Transformers Other
G
mlx-community
2,200
12
Gemma 3 27b It Qat Q4 0 Gguf
Gemma 3 is a lightweight open-source multimodal model series by Google, supporting text and image inputs with text generation capabilities. This version is a 27B parameter instruction-tuned model using quantization-aware training, offering lower memory requirements while maintaining near-original quality.
Image-to-Text
G
vinimuchulski
4,674
6
Gemma 3 27b It Mlx
This is an MLX-converted version of the Google Gemma 3 27B IT model, supporting image-text-to-text tasks.
Image-to-Text Transformers
G
stephenwalker
24
1
Rexseek 3B
Other
This is an image-to-text conversion model capable of processing both image and text inputs to generate corresponding text outputs.
Text-to-Image Transformers
R
IDEA-Research
186
4
Toriigate V0.4 7B I1 GGUF
Apache-2.0
This is a weighted/importance matrix quantized version of the Minthy/ToriiGate-v0.4-7B model, offering multiple quantization options to suit different needs.
Image-to-Text English
T
mradermacher
410
1
Gemma 3 12b It
Gemma is a lightweight cutting-edge open-source multimodal model series launched by Google, built on the technology used to create Gemini models, supporting text and image inputs to generate text outputs.
Image-to-Text Transformers
G
google
364.65k
340
Kowen Vol 1 Base 7B
Apache-2.0
A Korean vision-language model based on Qwen2-VL-7B-Instruct, supporting image-to-text tasks
Image-to-Text Transformers Korean
K
Gwonee
22
1
Aria Sequential Mlp Bnb Nf4
Apache-2.0
A BitsAndBytes NF4 quantized version based on Aria-sequential_mlp, suitable for image-to-text tasks with approximately 15.5 GB VRAM requirement.
Image-to-Text Transformers
A
leon-se
76
11
Doubutsu 2b Pt 756
Apache-2.0
Doubutsu is a lightweight vision-language model series, specifically designed for customized scenario fine-tuning.
Image-to-Text Transformers English
D
qresearch
129
3
Rgb Language Cap
MIT
This is a spatially-aware vision-language model capable of recognizing spatial relationships between objects in images and generating descriptive text.
Image-to-Text Transformers English
R
sadassa17
15
0
Pix2struct Infographics Vqa Base
Apache-2.0
Pix2Struct is a vision-language understanding model pretrained for image-to-text conversion tasks, specifically optimized for high-resolution infographic visual question answering.
Image-to-Text Transformers Supports Multiple Languages
P
google
74
8
Pix2struct Ocrvqa Base
Apache-2.0
Pix2Struct is a visual question answering model fine-tuned for OCR-VQA tasks, capable of parsing textual content in images and answering questions
Image-to-Text Transformers Supports Multiple Languages
P
google
38
1
Pix2struct Chartqa Base
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on image-text pairs for multitasking, specifically fine-tuned for chart question answering tasks
Text-to-Image Transformers Supports Multiple Languages
P
google
181
8
Git Large Textvqa
MIT
GIT is a vision-language model based on a Transformer decoder, trained with dual conditioning on CLIP image tokens and text tokens, specifically optimized for TextVQA tasks.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
62
4
Git Large Vqav2
MIT
GIT is a Transformer decoder based on CLIP image tokens and text tokens, trained on large-scale image-text pairs, suitable for tasks like visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
401
17
Git Base Textvqa
MIT
GIT is a Transformer-based vision-language model capable of converting images into textual descriptions, specifically fine-tuned for TextVQA tasks.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
1,182
6
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase