# Multimodal instruction fine-tuning

Qwen2.5 Vl Vqa Vibook
Apache-2.0
A visual question answering model based on the Qwen2.5 architecture, focusing on Vietnamese scenarios and supporting the answering of image-related questions.
Text-to-Image Other
Q
sunbv56
148
0
Qwen Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a multimodal vision-language model with a parameter scale of 32B, supporting image understanding and text generation tasks.
Text-to-Image English
Q
bartowski
2,782
1
Phi 4 Multimodal Instruct Ko Asr
A Korean automatic speech recognition (ASR) and speech translation (AST) model fine-tuned based on microsoft/Phi-4-multimodal-instruct, demonstrating excellent performance on the zeroth-korean and fleurs datasets.
Text-to-Audio Transformers Korean
P
junnei
354
3
Pixtral 12B Captioner Relaxed
Apache-2.0
A multimodal large language model fine-tuned based on Pixtral-12B-2409, specializing in generating rich image descriptions
Image-to-Text Transformers English
P
unalignment
26
3
Llama 3.2 11B Vision Instruct Abliterated 8 Bit
This is a multimodal model based on Llama-3.2-11B-Vision-Instruct, which supports image and text input and generates text output.
Image-to-Text Transformers Supports Multiple Languages
L
mlx-community
128
0
Xgen Mm Phi3 Mini Base R V1.5
Apache-2.0
xGen-MM is a series of the latest foundational large language models (LMMs) developed by Salesforce AI Research. It is improved on the basis of the BLIP series and incorporates enhanced features, with more powerful foundational capabilities.
Text-to-Image Safetensors English
X
Salesforce
830
21
Vip Llava 7b
ViP-LLaVA is an open-source multimodal chatbot, fine-tuned on LLaMA/Vicuna with image and region-level instruction data.
Text-to-Image Transformers
V
mucai
66.75k
8
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase