Model Selection

Multimodal Dialogue

# Multimodal Dialogue

Vora 7B Instruct

VoRA is a vision-language model based on 7B parameters, focusing on image-text-to-text conversion tasks.

VoRA is a vision-language model based on 7B parameters, capable of processing image and text inputs to generate text outputs.

Qwen2.5 VL 7B Instruct Q4 K M GGUF

This is the GGUF quantized version of the Qwen2.5-VL-7B-Instruct model, suitable for multimodal tasks and supports both image and text inputs.

Image-to-Text English

Q-SiT Mini is a lightweight image quality assessment and dialogue model, focusing on image quality analysis and scoring.

Llava NeXT Video 7B Hf

LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.

Video-to-Text English

Internvl2 5 4B AWQ

InternVL2_5-4B-AWQ is the AWQ quantized version of InternVL2_5-4B using autoawq, supporting multilingual and multimodal tasks.

Transformers Other

Internvl 2 5 HiCo R64

A video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, improving existing MLLMs by enhancing the perception of fine-grained details and capturing long-term temporal structures

Transformers English

Internlm Xcomposer2d5 7b Chat

InternLM-XComposer2.5-Chat is a dialogue model trained based on InternLM-XComposer2.5-7B, showing significant improvements in multimodal instruction following and open-ended dialogue capabilities.

QVQ 72B Preview Abliterated GPTQ Int8

This is the GPTQ quantized 8-bit version of the QVQ-72B-Preview-abliterated model, supporting image-text-to-text conversion tasks.

Transformers English

Apollo LMMs Apollo 7B T32

Apollo is a series of large multimodal models focused on video understanding, excelling in processing up to one-hour-long video content, supporting complex video QA and multi-turn dialogues.

Transformers English

Apollo LMMs Apollo 1 5B T32

Apollo is a series of large multimodal models focused on video understanding, excelling in tasks such as long video content comprehension, temporal reasoning, and complex video question answering.

Mini InternVL2 1B DA DriveLM

Mini-InternVL2-DA-RS is a multimodal model optimized for the remote sensing image domain, based on the Mini-InternVL architecture. It has been fine-tuned through a domain adaptation framework and demonstrates excellent performance in remote sensing image understanding tasks.

Transformers Other

VARCO VISION 14B HF

VARCO-VISION-14B is a powerful English-Korean visual language model that supports image and text input to generate text output, equipped with localization, referencing, and OCR capabilities.

Transformers Supports Multiple Languages

Aria Sequential Mlp Bnb Nf4

A BitsAndBytes NF4 quantized version based on Aria-sequential_mlp, suitable for image-to-text tasks with approximately 15.5 GB VRAM requirement.

Mplug Owl3 1B 241014

mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.

Safetensors English

Mplug Owl3 2B 241014

mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.

Text-to-Image English

Videochat2 HD Stage4 Mistral 7B Hf

VideoChat2-HD-hf is a multimodal video understanding model based on Mistral-7B, focusing on video-to-text conversion tasks.

Qwen2 Audio 7B Instruct 4bit

This is the 4-bit quantized version of Qwen2-Audio-7B-Instruct, developed based on Alibaba Cloud's original Qwen model. It is an audio-text multimodal large language model.

Internvideo2 Chat 8B InternLM2 5

InternVideo2-Chat-8B-InternLM2.5 is a video-text multimodal model that enhances video understanding and human-computer interaction by integrating the InternVideo2 video encoder with a large language model (LLM).

Mplug Owl3 7B 240728

mPLUG-Owl3 is a cutting-edge multimodal large language model designed to tackle the challenges of long image sequence understanding, supporting single-image, multi-image, and video tasks.

Safetensors English

Banban Beta V2 Gguf

AI virtual anchor BanBan model, a virtual anchor assistant designed specifically for the NTNU VLSI club, capable of image-text-to-text conversion.

Image-to-Text Supports Multiple Languages

LLaVA-Saiga-8b is a vision-language model (VLM) developed based on the IlyaGusev/saiga_llama3_8b model, primarily optimized for Russian tasks while retaining English processing capabilities.

Transformers Supports Multiple Languages

Tinyllava 1.1b V0.1

A lightweight visual question answering model based on TinyLlama-1.1B, trained using the BakLlava codebase, supporting image content understanding and question-answering tasks.

Llava Calm2 Siglip

llava-calm2-siglip is an experimental vision-language model capable of answering questions about images in Japanese and English.

Transformers Supports Multiple Languages

Paligemma 3B Chat V0.2

A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios

Transformers Supports Multiple Languages

Vision 8B MiniCPM 2 5 Uncensored And Detailed 4bit

The int4 quantized version of MiniCPM-Llama3-V 2.5, significantly reducing GPU VRAM usage (approximately 9GB)

Cogvlm2 Llama3 Chat 19B Int4

CogVLM2 is a multimodal dialogue model based on Meta-Llama-3-8B-Instruct, supporting both Chinese and English, with 8K context length and 1344*1344 resolution image processing capabilities.

Transformers English

Minicpm Llama3 V 2 5 Int4

The int4 quantized version of MiniCPM-Llama3-V 2.5 significantly reduces GPU VRAM usage to approximately 9GB, suitable for visual question answering tasks.

360VL is an open-source large multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual text support capabilities.

Transformers Supports Multiple Languages

Cogvlm2 Llama3 Chinese Chat 19B

CogVLM2 is a multimodal large model built on Meta-Llama-3-8B-Instruct, supporting both Chinese and English with powerful image understanding and dialogue capabilities.

Transformers English

Cogvlm2 Llama3 Chat 19B

CogVLM2 is a multimodal large model built upon Meta-Llama-3-8B-Instruct, supporting image understanding and dialogue tasks with 8K context length and 1344x1344 image resolution processing capability.

Transformers English

360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.

Transformers Supports Multiple Languages

A multimodal dialogue model developed through instruction fine-tuning based on Libra-Base, capable of image understanding and text generation

Llava Llama 3 8b

A large multimodal model trained based on the LLaVA-v1.5 framework, using the 8-billion-parameter Meta-Llama-3-8B-Instruct as the language backbone and equipped with a CLIP-based visual encoder.

Llava Llama 3 8b V1 1 GGUF

LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks

Llava Llama 3 8b V1 1 Gguf

A multimodal model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image understanding and text generation

Llava Llama 3 8b V1 1 Transformers

A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks

Llava Phi 3 Mini Gguf

LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.

Llava Phi 3 Mini Hf

LLaVA model fine-tuned based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks

Llava Llama 3 8b V1 1 Q3 K S GGUF

This model is a GGUF format conversion based on xtuner/llava-llama-3-8b-v1_1, supporting multimodal processing of images and text.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase