Model Selection

Multimodal Image Understanding

# Multimodal Image Understanding

Pixtral 12b GGUF

A multimodal large model launched by Mistral-Community, supporting image and text processing with 128k context length and variable image size handling capabilities.

lmstudio-community

Gemma 3 12b It Qat 8bit

An 8-bit quantized version converted from the Google Gemma 3 12B model, suitable for image-text to text tasks.

Transformers Other

Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for both images and text.

Image-to-Text English

Qwen2.5 VL 7B Instruct GGUF

Qwen2.5-VL-7B-Instruct is a multimodal vision-language model that supports image-text generation tasks.

Image-to-Text English

Qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a multimodal vision-language model that supports interactive generation tasks involving images and text.

Image-to-Text English

Gemma 3 12b It Gguf

Gemma-3 is a lightweight multimodal open model launched by Google, supporting text and image inputs to generate text outputs. Built on the research and technology behind the Gemini model, it features a 128K large context window and supports over 140 languages.

Gemma 3 4b It Gguf

Gemma 3 is a lightweight open-source multimodal model introduced by Google, supporting image and text inputs to generate text outputs.

Asagi-14B is a large-scale Japanese Vision and Language Model (VLM) trained on a wide range of Japanese datasets, integrating diverse data sources.

Transformers Japanese

Qwen2 VL 2B Instruct GGUF

Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text generation tasks, based on the Qwen2 architecture with a parameter scale of 2B.

Image-to-Text English

Llama3 Chat Vector Kor Llava V02

This is a Korean multimodal model based on the Llama3 architecture, supporting image understanding and Korean dialogue.

Transformers Supports Multiple Languages

Turkish LLaVA V0.1 Q4 K M GGUF

Turkish-LLaVA-v0.1-Q4_K_M-GGUF is a Turkish vision-language model that supports image-text to text processing tasks.

Image-to-Text Other

Cerule is a lightweight yet powerful vision-language model built on Google's Gemma-2b and SigLIP, focusing on image-text processing.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase