Model Selection

Image-to-Text Generation

# Image-to-Text Generation

Dimple is the first discrete diffusion multimodal large language model (DMLLM) that combines autoregressive and diffusion training paradigms. After training on the same dataset as LLaVA-NEXT, it outperforms LLaVA-NEXT-7B by 3.9%.

Transformers English

Magma-8B is an image-text-to-text conversion model based on the GGUF format, suitable for multimodal task processing.

Llava 1.5 7b Hf Q4 K M GGUF

This model is a GGUF format conversion of llava-hf/llava-1.5-7b-hf, supporting image-to-text generation tasks.

Image-to-Text English

Gemma 3 27b It Qat 3bit

This model is a 3-bit quantized version converted from google/gemma-3-27b-it-qat-q4_0-unquantized to the MLX format, suitable for image-to-text tasks.

Transformers Other

Gemma 3 27b It Qat 4bit

Gemma 3 27B IT QAT 4bit is an MLX-format model converted from Google's original model, supporting image-to-text tasks.

Transformers Other

Gemma 3 27b It Qat Q4 0 Gguf

Gemma 3 is a lightweight open-source multimodal model series by Google, supporting text and image inputs with text generation capabilities. This version is a 27B parameter instruction-tuned model using quantization-aware training, offering lower memory requirements while maintaining near-original quality.

Gemma 3 27b It Mlx

This is an MLX-converted version of the Google Gemma 3 27B IT model, supporting image-text-to-text tasks.

This is an image-to-text conversion model capable of processing both image and text inputs to generate corresponding text outputs.

Toriigate V0.4 7B I1 GGUF

This is a weighted/importance matrix quantized version of the Minthy/ToriiGate-v0.4-7B model, offering multiple quantization options to suit different needs.

Image-to-Text English

Gemma is a lightweight cutting-edge open-source multimodal model series launched by Google, built on the technology used to create Gemini models, supporting text and image inputs to generate text outputs.

Kowen Vol 1 Base 7B

A Korean vision-language model based on Qwen2-VL-7B-Instruct, supporting image-to-text tasks

Transformers Korean

Aria Sequential Mlp Bnb Nf4

A BitsAndBytes NF4 quantized version based on Aria-sequential_mlp, suitable for image-to-text tasks with approximately 15.5 GB VRAM requirement.

Doubutsu 2b Pt 756

Doubutsu is a lightweight vision-language model series, specifically designed for customized scenario fine-tuning.

Transformers English

Rgb Language Cap

This is a spatially-aware vision-language model capable of recognizing spatial relationships between objects in images and generating descriptive text.

Transformers English

Pix2struct Infographics Vqa Base

Pix2Struct is a vision-language understanding model pretrained for image-to-text conversion tasks, specifically optimized for high-resolution infographic visual question answering.

Transformers Supports Multiple Languages

Pix2struct Ocrvqa Base

Pix2Struct is a visual question answering model fine-tuned for OCR-VQA tasks, capable of parsing textual content in images and answering questions

Transformers Supports Multiple Languages

Pix2struct Chartqa Base

Pix2Struct is an image encoder-text decoder model trained on image-text pairs for multitasking, specifically fine-tuned for chart question answering tasks

Transformers Supports Multiple Languages

Git Large Textvqa

GIT is a vision-language model based on a Transformer decoder, trained with dual conditioning on CLIP image tokens and text tokens, specifically optimized for TextVQA tasks.

Transformers Supports Multiple Languages

Git Large Vqav2

GIT is a Transformer decoder based on CLIP image tokens and text tokens, trained on large-scale image-text pairs, suitable for tasks like visual question answering.

Transformers Supports Multiple Languages

Git Base Textvqa

GIT is a Transformer-based vision-language model capable of converting images into textual descriptions, specifically fine-tuned for TextVQA tasks.

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase