# Multimodal understanding

Gemma 3 27b It Quantized.w4a16
This is a quantized version of google/gemma-3-27b-it, supporting visual-text input and text output. Optimized through weight quantization and activation quantization, it enables efficient inference with vLLM.
Image-to-Text Transformers
G
RedHatAI
302
1
Fastvlm 0.5B Stage3
Other
FastVLM-0.5B-Stage3 is an efficient multimodal language model with visual understanding and language processing capabilities. It can process long videos and generate structured outputs.
Image-to-Text Transformers English
F
zhaode
174
1
Fastvlm 0.5B Stage2
Other
FastVLM-0.5B-Stage2 is an efficient multimodal language model capable of understanding visual content and handling text tasks.
Multimodal Fusion Transformers English
F
zhaode
103
1
Gemma 3 1b It Qat Bnb 4bit
Gemma 3 is a lightweight open model series launched by Google, built on Gemini technology, supporting multimodal input and text output.
Image-to-Text Transformers
G
unsloth
23
1
Webssl Dino7b Full8b 518
A 7-billion-parameter visual Transformer model trained on 8 billion MetaCLIP data using the DINOv2 self-supervised learning framework, requiring no language supervision
Image Classification Transformers
W
facebook
157
7
Gemma 3 27b It Qat Unsloth Bnb 4bit
Gemma 3 is a lightweight, state-of-the-art multimodal open-source model launched by Google, capable of processing text and image inputs and generating text outputs.
Image-to-Text Transformers
G
unsloth
2,591
1
Gemma 3 1b It Qat
Gemma 3 is a lightweight multimodal model launched by Google, capable of processing text and image inputs and generating text outputs. This model has a 128K large context window and multilingual support for over 140 languages.
Image-to-Text Transformers
G
unsloth
2,558
1
Gemma 3 4b It Qat Unsloth Bnb 4bit
Gemma 3 is a lightweight, cutting-edge open model series launched by Google, built on Gemini model technology, supporting multimodal input and text output.
Image-to-Text Transformers
G
unsloth
918
1
Gemma 3 27b It Qat
Gemma is a lightweight open model series launched by Google, built on Gemini model technology. Gemma 3 is a multimodal model supporting text and image inputs with text outputs, featuring a 128K large context window and multilingual capabilities.
Image-to-Text Transformers
G
unsloth
168
2
Gemma 3 12b It Qat Unsloth Bnb 4bit
Gemma 3 is a lightweight and state-of-the-art open model family launched by Google, built on the same research and technology as the Gemini model. It supports multimodal input and text output.
Image-to-Text Transformers
G
unsloth
1,422
1
Gemma 3 12b It Qat
Gemma 3 is a lightweight, state-of-the-art multimodal open-source model launched by Google. It can process text and image inputs and generate text outputs, suitable for various text generation and image understanding tasks.
Image-to-Text Transformers
G
unsloth
952
2
Kimi VL A3B Thinking 8bit
Other
Kimi-VL-A3B-Thinking-8bit is a multimodal vision-language model converted based on the MLX format, supporting image-text to text generation tasks.
Image-to-Text Transformers Other
K
mlx-community
1,738
1
Kimi VL A3B Thinking 6bit
Other
Kimi-VL-A3B-Thinking-6bit is a multilingual vision-language model converted based on the MLX format, supporting image-text to text tasks.
Image-to-Text Transformers Other
K
mlx-community
135
0
Gemma 3 27b It Qat Bf16
Gemma 3 27B IT QAT BF16 is a version of the Gemma series of models released by Google. It has undergone quantization-aware training (QAT) and is converted to the BF16 format, suitable for the MLX framework.
Image-to-Text Transformers
G
mlx-community
178
2
Gemma 3 27b It Qat 6bit
Other
This is a quantized version based on the Google Gemma 3 27B model, supporting 6-bit quantization and suitable for image-text to text tasks.
Image-to-Text Transformers Other
G
mlx-community
110
0
Mistral Small 3.1 24B Instruct 2503 Quantized.w8a8
Apache-2.0
This is an INT8-quantized Mistral-Small-3.1-24B-Instruct-2503 model, optimized by Red Hat and Neural Magic, suitable for fast response and low-latency scenarios.
Safetensors Supports Multiple Languages
M
RedHatAI
833
2
Gemma 3 4b It Qat 4bit
Other
Gemma 3 4B IT QAT 4bit is a 4-bit quantized large language model trained with Quantization-Aware Training (QAT), based on the Gemma 3 architecture and optimized for the MLX framework.
Image-to-Text Transformers Other
G
mlx-community
607
1
Gemma 3 27b It Qat Q4 0 Unquantized
Gemma 3 is a lightweight and advanced multimodal open model launched by Google. It is built on the same research and technology as the Gemini model, supporting text and image inputs and generating text outputs.
Text-to-Image Transformers
G
google
11.53k
23
Debiased Llama 4 Scout 17B 16E Instruct
Llama 4 Scout is a native multimodal AI model launched by Meta, supporting multilingual text and image understanding. It adopts the Mixture of Experts architecture and has industry-leading performance in text and image understanding.
Text-to-Image Transformers Supports Multiple Languages
D
hirundo-io
1,716
0
Videochat R1 7B
Apache-2.0
VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.
Video-to-Text Transformers English
V
OpenGVLab
1,686
7
Gemma 3 12b It Qat Int4 Unquantized
Gemma 3 is a lightweight multimodal open model from Google, supporting text and image inputs with text output, featuring a 128K large context window and multilingual capabilities.
Image-to-Text Transformers
G
google
1,358
9
Gemma 3 4b It Qat Int4 Unquantized
Gemma 3 is a lightweight multimodal open model launched by Google, supporting text and image input and generating text output. The 4B version has undergone instruction tuning and quantization-aware training, making it suitable for deployment in resource-constrained environments.
Image-to-Text Transformers
G
google
541
3
Gemma 3 27b It Qat Compressed Tensors
Gemma 3 is a lightweight and advanced open model series launched by Google, built on the same research and technology as the Gemini model. This version is an instruction-tuned model with 27B parameters, using quantization-aware training (QAT) and compressed tensor technology.
Image-to-Text
G
gaunernst
1,985
6
Gemma 3 12b It Qat Compressed Tensors
Gemma 3 is Google's lightweight cutting-edge open model family, built on the same research and technology used to create Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.
Text-to-Image
G
gaunernst
867
1
Google Gemma 3 27b It
Gemma 3 is a lightweight and state-of-the-art open model family launched by Google, built on the same research and technology as the Gemini model. It is a multimodal model that can process text and image inputs and generate text outputs.
Image-to-Text Transformers
G
context-labs
2,313
0
Gemma 3 12b It Qat Q4 0 GGUF
Gemma is a lightweight, cutting-edge open model series from Google, built on Gemini technology. The 12B version is a multimodal model supporting text and image input, featuring a 128K large context window and support for over 140 languages.
Image-to-Text
G
Mungert
1,008
3
Gemma 3 4b It Qat Q4 0 Gguf
Gemma 3 is a lightweight open-source multimodal model family launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.
Image-to-Text
G
vinimuchulski
197
0
Gemma 3 1b It Llamafile
Gemma is a lightweight open model series launched by Google, built on the same research technology as Gemini. The llamafile version is packaged as an executable file by Mozilla for easy use on multiple platforms.
Text-to-Image
G
Mozilla
469
3
Gemma 3 4b Pt
Gemma 3 is a lightweight, state-of-the-art open model family launched by Google, built on the same research and technology as the Gemini model. It supports multimodality, can process text and image inputs and generate text outputs, and is suitable for a variety of text generation and image understanding tasks.
Image-to-Text Transformers
G
axolotl-mirrors
4,332
0
Mistral Small 3.1 24B Instruct 2503 FP8 Dynamic
Apache-2.0
This is a 24B-parameter conditional generation model based on the Mistral3 architecture, optimized with FP8 dynamic quantization, suitable for multilingual text generation and visual understanding tasks.
Safetensors Supports Multiple Languages
M
RedHatAI
2,650
5
Mistral Small 3.1 24B Instruct 2503
Apache-2.0
Mistral Small 3.1 is a large multimodal language model with 24 billion parameters, possessing visual understanding ability and 128k long context processing ability, suitable for various tasks.
Image-to-Text Supports Multiple Languages
M
chutesai
2,035
0
Gemma 3 27b It Int4 Awq
Gemma is a lightweight and advanced open model series launched by Google, built on the same research and technology as Gemini. The 27B version is a multimodal model that supports text and image input and generates text output.
Text-to-Image Transformers
G
gaunernst
17.62k
16
Gemma 3 27b Pt Qat Q4 0 Gguf
Gemma is a lightweight and cutting-edge open model family launched by Google, built on the same research and technology as the Gemini model. Gemma 3 is a multimodal model that can process text and image inputs and generate text outputs.
Image-to-Text
G
google
633
24
Gemma 3 27b It Qat Q4 0 Gguf
Gemma is a lightweight open-source multimodal model series launched by Google. It supports text and image inputs and generates text outputs. It has a 128K large context window and supports over 140 languages.
Image-to-Text
G
google
69.29k
251
Gemma 3 4b It Int4 Awq
Gemma is a lightweight, advanced open model series from Google, built using the same research technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.
Text-to-Image Transformers
G
gaunernst
1,054
1
Qwen2 VL 72B Instruct
Other
Qwen2-VL-72B-Instruct is a multimodal vision-language model that supports interaction between images and text, suitable for complex vision-language tasks.
Image-to-Text Transformers English
Q
FriendliAI
18
1
Gemma 3 27b It GPTQ 4b 128g
This model is an INT4 quantized version of gemma-3-27b-it, reducing disk and GPU memory requirements by decreasing the number of bits per parameter.
Image-to-Text Transformers
G
ISTA-DASLab
32.15k
25
Gemma 3 4b It Qat Q4 0 Gguf
Gemma 3 is Google's lightweight cutting-edge open-source multimodal model supporting text and image inputs with text output, featuring 128K context window and 140+ language support
Image-to-Text
G
google
19.81k
120
Google.gemma 3 27b It GGUF
A quantized version based on Google's Gemma-3-27b-it model, focusing on image text-to-text tasks and committed to knowledge popularization
Large Language Model
G
DevQuasar
123
0
Gemma 3 27b It GGUF
Gemma 3 is a lightweight multimodal model launched by Google. It is built on the same technology as Gemini, supports text and image inputs, and outputs text. It is suitable for various tasks.
Image-to-Text
G
ggml-org
2,882
21
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase