Model Selection

Multimodal understanding

# Multimodal understanding

Gemma 3 27b It Quantized.w4a16

This is a quantized version of google/gemma-3-27b-it, supporting visual-text input and text output. Optimized through weight quantization and activation quantization, it enables efficient inference with vLLM.

Fastvlm 0.5B Stage3

FastVLM-0.5B-Stage3 is an efficient multimodal language model with visual understanding and language processing capabilities. It can process long videos and generate structured outputs.

Transformers English

Fastvlm 0.5B Stage2

FastVLM-0.5B-Stage2 is an efficient multimodal language model capable of understanding visual content and handling text tasks.

Multimodal Fusion

Transformers English

Gemma 3 1b It Qat Bnb 4bit

Gemma 3 is a lightweight open model series launched by Google, built on Gemini technology, supporting multimodal input and text output.

Webssl Dino7b Full8b 518

A 7-billion-parameter visual Transformer model trained on 8 billion MetaCLIP data using the DINOv2 self-supervised learning framework, requiring no language supervision

Image Classification

Gemma 3 27b It Qat Unsloth Bnb 4bit

Gemma 3 is a lightweight, state-of-the-art multimodal open-source model launched by Google, capable of processing text and image inputs and generating text outputs.

Gemma 3 1b It Qat

Gemma 3 is a lightweight multimodal model launched by Google, capable of processing text and image inputs and generating text outputs. This model has a 128K large context window and multilingual support for over 140 languages.

Gemma 3 4b It Qat Unsloth Bnb 4bit

Gemma 3 is a lightweight, cutting-edge open model series launched by Google, built on Gemini model technology, supporting multimodal input and text output.

Gemma 3 27b It Qat

Gemma is a lightweight open model series launched by Google, built on Gemini model technology. Gemma 3 is a multimodal model supporting text and image inputs with text outputs, featuring a 128K large context window and multilingual capabilities.

Gemma 3 12b It Qat Unsloth Bnb 4bit

Gemma 3 is a lightweight and state-of-the-art open model family launched by Google, built on the same research and technology as the Gemini model. It supports multimodal input and text output.

Gemma 3 12b It Qat

Gemma 3 is a lightweight, state-of-the-art multimodal open-source model launched by Google. It can process text and image inputs and generate text outputs, suitable for various text generation and image understanding tasks.

Kimi VL A3B Thinking 8bit

Kimi-VL-A3B-Thinking-8bit is a multimodal vision-language model converted based on the MLX format, supporting image-text to text generation tasks.

Transformers Other

Kimi VL A3B Thinking 6bit

Kimi-VL-A3B-Thinking-6bit is a multilingual vision-language model converted based on the MLX format, supporting image-text to text tasks.

Transformers Other

Gemma 3 27b It Qat Bf16

Gemma 3 27B IT QAT BF16 is a version of the Gemma series of models released by Google. It has undergone quantization-aware training (QAT) and is converted to the BF16 format, suitable for the MLX framework.

Gemma 3 27b It Qat 6bit

This is a quantized version based on the Google Gemma 3 27B model, supporting 6-bit quantization and suitable for image-text to text tasks.

Transformers Other

Mistral Small 3.1 24B Instruct 2503 Quantized.w8a8

This is an INT8-quantized Mistral-Small-3.1-24B-Instruct-2503 model, optimized by Red Hat and Neural Magic, suitable for fast response and low-latency scenarios.

Safetensors Supports Multiple Languages

Gemma 3 4b It Qat 4bit

Gemma 3 4B IT QAT 4bit is a 4-bit quantized large language model trained with Quantization-Aware Training (QAT), based on the Gemma 3 architecture and optimized for the MLX framework.

Transformers Other

Gemma 3 27b It Qat Q4 0 Unquantized

Gemma 3 is a lightweight and advanced multimodal open model launched by Google. It is built on the same research and technology as the Gemini model, supporting text and image inputs and generating text outputs.

Debiased Llama 4 Scout 17B 16E Instruct

Llama 4 Scout is a native multimodal AI model launched by Meta, supporting multilingual text and image understanding. It adopts the Mixture of Experts architecture and has industry-leading performance in text and image understanding.

Transformers Supports Multiple Languages

Videochat R1 7B

VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.

Transformers English

Gemma 3 12b It Qat Int4 Unquantized

Gemma 3 is a lightweight multimodal open model from Google, supporting text and image inputs with text output, featuring a 128K large context window and multilingual capabilities.

Gemma 3 4b It Qat Int4 Unquantized

Gemma 3 is a lightweight multimodal open model launched by Google, supporting text and image input and generating text output. The 4B version has undergone instruction tuning and quantization-aware training, making it suitable for deployment in resource-constrained environments.

Gemma 3 27b It Qat Compressed Tensors

Gemma 3 is a lightweight and advanced open model series launched by Google, built on the same research and technology as the Gemini model. This version is an instruction-tuned model with 27B parameters, using quantization-aware training (QAT) and compressed tensor technology.

Gemma 3 12b It Qat Compressed Tensors

Gemma 3 is Google's lightweight cutting-edge open model family, built on the same research and technology used to create Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.

Google Gemma 3 27b It

Gemma 3 is a lightweight and state-of-the-art open model family launched by Google, built on the same research and technology as the Gemini model. It is a multimodal model that can process text and image inputs and generate text outputs.

Gemma 3 12b It Qat Q4 0 GGUF

Gemma is a lightweight, cutting-edge open model series from Google, built on Gemini technology. The 12B version is a multimodal model supporting text and image input, featuring a 128K large context window and support for over 140 languages.

Gemma 3 4b It Qat Q4 0 Gguf

Gemma 3 is a lightweight open-source multimodal model family launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.

Gemma 3 1b It Llamafile

Gemma is a lightweight open model series launched by Google, built on the same research technology as Gemini. The llamafile version is packaged as an executable file by Mozilla for easy use on multiple platforms.

Gemma 3 is a lightweight, state-of-the-art open model family launched by Google, built on the same research and technology as the Gemini model. It supports multimodality, can process text and image inputs and generate text outputs, and is suitable for a variety of text generation and image understanding tasks.

axolotl-mirrors

Mistral Small 3.1 24B Instruct 2503 FP8 Dynamic

This is a 24B-parameter conditional generation model based on the Mistral3 architecture, optimized with FP8 dynamic quantization, suitable for multilingual text generation and visual understanding tasks.

Safetensors Supports Multiple Languages

Mistral Small 3.1 24B Instruct 2503

Mistral Small 3.1 is a large multimodal language model with 24 billion parameters, possessing visual understanding ability and 128k long context processing ability, suitable for various tasks.

Image-to-Text Supports Multiple Languages

Gemma 3 27b It Int4 Awq

Gemma is a lightweight and advanced open model series launched by Google, built on the same research and technology as Gemini. The 27B version is a multimodal model that supports text and image input and generates text output.

Gemma 3 27b Pt Qat Q4 0 Gguf

Gemma is a lightweight and cutting-edge open model family launched by Google, built on the same research and technology as the Gemini model. Gemma 3 is a multimodal model that can process text and image inputs and generate text outputs.

Gemma 3 27b It Qat Q4 0 Gguf

Gemma is a lightweight open-source multimodal model series launched by Google. It supports text and image inputs and generates text outputs. It has a 128K large context window and supports over 140 languages.

Gemma 3 4b It Int4 Awq

Gemma is a lightweight, advanced open model series from Google, built using the same research technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.

Qwen2 VL 72B Instruct

Qwen2-VL-72B-Instruct is a multimodal vision-language model that supports interaction between images and text, suitable for complex vision-language tasks.

Transformers English

Gemma 3 27b It GPTQ 4b 128g

This model is an INT4 quantized version of gemma-3-27b-it, reducing disk and GPU memory requirements by decreasing the number of bits per parameter.

Gemma 3 4b It Qat Q4 0 Gguf

Gemma 3 is Google's lightweight cutting-edge open-source multimodal model supporting text and image inputs with text output, featuring 128K context window and 140+ language support

Google.gemma 3 27b It GGUF

A quantized version based on Google's Gemma-3-27b-it model, focusing on image text-to-text tasks and committed to knowledge popularization

Large Language Model

Gemma 3 27b It GGUF

Gemma 3 is a lightweight multimodal model launched by Google. It is built on the same technology as Gemini, supports text and image inputs, and outputs text. It is suitable for various tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase