# Multimodal instruction following
Mistral Small 3.2 24B Instruct 2506
Apache-2.0
Mistral-Small-3.2-24B-Instruct-2506 is an image-text-to-text model and an updated version of Mistral-Small-3.1-24B-Instruct-2503, with improvements in instruction following, reducing repetition errors, and function calls.
Text-to-Image
Safetensors Supports Multiple Languages
M
unsloth
1,750
2
Qwen3 8B GGUF
Apache-2.0
An 8B-parameter large language model developed by the Qwen team, supporting ultra-long context and multilingual processing
Large Language Model
Q
lmstudio-community
39.45k
6
Llama 3.2 11B Vision Instruct Nf4
4-bit quantized version based on meta-llama/Llama-3.2-11B-Vision-Instruct, supporting image understanding and text generation tasks
Image-to-Text
Transformers

L
SeanScripts
658
12
MQT LLaVA 7b
MQT-LLaVA is an open-source multimodal chatbot model based on the Transformer architecture. It is trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction data.
Text-to-Image
Transformers

M
gordonhu
349
5
Llama Vid 7b Full 224 Video Fps 1
LLaMA-VID is an open-source multimodal chatbot fine-tuned from LLaMA/Vicuna, supporting hours-long video processing through extended context tokens.
Text-to-Video
Transformers

L
YanweiLi
86
9
Bakllava 1
Apache-2.0
BakLLaVA-1 is a multimodal model based on Mistral 7B and enhanced with the LLaVA 1.5 architecture, outperforming Llama 2 13B on multiple benchmarks.
Text-to-Image
Transformers English

B
SkunkworksAI
152
380
Featured Recommended AI Models