# Multimodal Image Understanding

Pixtral 12b GGUF
Apache-2.0
A multimodal large model launched by Mistral-Community, supporting image and text processing with 128k context length and variable image size handling capabilities.
Image-to-Text
P
lmstudio-community
611
1
Gemma 3 12b It Qat 8bit
Other
An 8-bit quantized version converted from the Google Gemma 3 12B model, suitable for image-text to text tasks.
Image-to-Text Transformers Other
G
mlx-community
149
1
Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for both images and text.
Image-to-Text English
Q
samgreen
25.59k
6
Qwen2.5 VL 7B Instruct GGUF
Apache-2.0
Qwen2.5-VL-7B-Instruct is a multimodal vision-language model that supports image-text generation tasks.
Image-to-Text English
Q
samgreen
5,052
9
Qwen2.5 VL 72B Instruct GGUF
Other
Qwen2.5-VL-72B-Instruct is a multimodal vision-language model that supports interactive generation tasks involving images and text.
Image-to-Text English
Q
samgreen
2,073
1
Gemma 3 12b It Gguf
Gemma-3 is a lightweight multimodal open model launched by Google, supporting text and image inputs to generate text outputs. Built on the research and technology behind the Gemini model, it features a 128K large context window and supports over 140 languages.
Image-to-Text
G
Mungert
4,574
11
Gemma 3 4b It Gguf
Gemma 3 is a lightweight open-source multimodal model introduced by Google, supporting image and text inputs to generate text outputs.
Image-to-Text
G
Mungert
4,593
9
Asagi 14B
Apache-2.0
Asagi-14B is a large-scale Japanese Vision and Language Model (VLM) trained on a wide range of Japanese datasets, integrating diverse data sources.
Image-to-Text Transformers Japanese
A
MIL-UT
83
9
Qwen2 VL 2B Instruct GGUF
Apache-2.0
Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text generation tasks, based on the Qwen2 architecture with a parameter scale of 2B.
Image-to-Text English
Q
second-state
125
3
Llama3 Chat Vector Kor Llava V02
This is a Korean multimodal model based on the Llama3 architecture, supporting image understanding and Korean dialogue.
Image-to-Text Transformers Supports Multiple Languages
L
nebchi
27
2
Turkish LLaVA V0.1 Q4 K M GGUF
MIT
Turkish-LLaVA-v0.1-Q4_K_M-GGUF is a Turkish vision-language model that supports image-text to text processing tasks.
Image-to-Text Other
T
atasoglu
127
4
Cerule V0.1
Cerule is a lightweight yet powerful vision-language model built on Google's Gemma-2b and SigLIP, focusing on image-text processing.
Image-to-Text Transformers English
C
Tensoic
157
47
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase