Model Selection

Multimodal instruction fine-tuning

# Multimodal instruction fine-tuning

Qwen2.5 Vl Vqa Vibook

A visual question answering model based on the Qwen2.5 architecture, focusing on Vietnamese scenarios and supporting the answering of image-related questions.

Text-to-Image Other

Qwen Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a multimodal vision-language model with a parameter scale of 32B, supporting image understanding and text generation tasks.

Text-to-Image English

Phi 4 Multimodal Instruct Ko Asr

A Korean automatic speech recognition (ASR) and speech translation (AST) model fine-tuned based on microsoft/Phi-4-multimodal-instruct, demonstrating excellent performance on the zeroth-korean and fleurs datasets.

Transformers Korean

Pixtral 12B Captioner Relaxed

A multimodal large language model fine-tuned based on Pixtral-12B-2409, specializing in generating rich image descriptions

Transformers English

Llama 3.2 11B Vision Instruct Abliterated 8 Bit

This is a multimodal model based on Llama-3.2-11B-Vision-Instruct, which supports image and text input and generates text output.

Transformers Supports Multiple Languages

Xgen Mm Phi3 Mini Base R V1.5

xGen-MM is a series of the latest foundational large language models (LMMs) developed by Salesforce AI Research. It is improved on the basis of the BLIP series and incorporates enhanced features, with more powerful foundational capabilities.

Safetensors English

ViP-LLaVA is an open-source multimodal chatbot, fine-tuned on LLaMA/Vicuna with image and region-level instruction data.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase