Model Selection

Multimodal instruction following

# Multimodal instruction following

Mistral Small 3.2 24B Instruct 2506

Mistral-Small-3.2-24B-Instruct-2506 is an image-text-to-text model and an updated version of Mistral-Small-3.1-24B-Instruct-2503, with improvements in instruction following, reducing repetition errors, and function calls.

Safetensors Supports Multiple Languages

An 8B-parameter large language model developed by the Qwen team, supporting ultra-long context and multilingual processing

Large Language Model

lmstudio-community

Llama 3.2 11B Vision Instruct Nf4

4-bit quantized version based on meta-llama/Llama-3.2-11B-Vision-Instruct, supporting image understanding and text generation tasks

MQT-LLaVA is an open-source multimodal chatbot model based on the Transformer architecture. It is trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction data.

Llama Vid 7b Full 224 Video Fps 1

LLaMA-VID is an open-source multimodal chatbot fine-tuned from LLaMA/Vicuna, supporting hours-long video processing through extended context tokens.

BakLLaVA-1 is a multimodal model based on Mistral 7B and enhanced with the LLaVA 1.5 architecture, outperforming Llama 2 13B on multiple benchmarks.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase