# Multimodal Dialogue
Vora 7B Instruct
VoRA is a vision-language model based on 7B parameters, focusing on image-text-to-text conversion tasks.
Image-to-Text
Transformers

V
Hon-Wong
154
12
Vora 7B Base
VoRA is a vision-language model based on 7B parameters, capable of processing image and text inputs to generate text outputs.
Image-to-Text
Transformers

V
Hon-Wong
62
4
Qwen2.5 VL 7B Instruct Q4 K M GGUF
Apache-2.0
This is the GGUF quantized version of the Qwen2.5-VL-7B-Instruct model, suitable for multimodal tasks and supports both image and text inputs.
Image-to-Text English
Q
PatataAliena
69
1
Q Sit
MIT
Q-SiT Mini is a lightweight image quality assessment and dialogue model, focusing on image quality analysis and scoring.
Image-to-Text
Transformers

Q
zhangzicheng
79
0
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.
Video-to-Text English
L
FriendliAI
30
0
Internvl2 5 4B AWQ
MIT
InternVL2_5-4B-AWQ is the AWQ quantized version of InternVL2_5-4B using autoawq, supporting multilingual and multimodal tasks.
Image-to-Text
Transformers Other

I
rootonchair
29
2
Internvl 2 5 HiCo R64
Apache-2.0
A video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, improving existing MLLMs by enhancing the perception of fine-grained details and capturing long-term temporal structures
Video-to-Text
Transformers English

I
OpenGVLab
252
2
Internlm Xcomposer2d5 7b Chat
Other
InternLM-XComposer2.5-Chat is a dialogue model trained based on InternLM-XComposer2.5-7B, showing significant improvements in multimodal instruction following and open-ended dialogue capabilities.
Text-to-Image
PyTorch
I
internlm
87
5
QVQ 72B Preview Abliterated GPTQ Int8
Other
This is the GPTQ quantized 8-bit version of the QVQ-72B-Preview-abliterated model, supporting image-text-to-text conversion tasks.
Image-to-Text
Transformers English

Q
huihui-ai
48
1
Apollo LMMs Apollo 7B T32
Apache-2.0
Apollo is a series of large multimodal models focused on video understanding, excelling in processing up to one-hour-long video content, supporting complex video QA and multi-turn dialogues.
Video-to-Text
Transformers English

A
GoodiesHere
67
55
Apollo LMMs Apollo 1 5B T32
Apache-2.0
Apollo is a series of large multimodal models focused on video understanding, excelling in tasks such as long video content comprehension, temporal reasoning, and complex video question answering.
Video-to-Text
A
GoodiesHere
37
10
Mini InternVL2 1B DA DriveLM
MIT
Mini-InternVL2-DA-RS is a multimodal model optimized for the remote sensing image domain, based on the Mini-InternVL architecture. It has been fine-tuned through a domain adaptation framework and demonstrates excellent performance in remote sensing image understanding tasks.
Image-to-Text
Transformers Other

M
OpenGVLab
61
1
VARCO VISION 14B HF
VARCO-VISION-14B is a powerful English-Korean visual language model that supports image and text input to generate text output, equipped with localization, referencing, and OCR capabilities.
Image-to-Text
Transformers Supports Multiple Languages

V
NCSOFT
449
24
Aria Sequential Mlp Bnb Nf4
Apache-2.0
A BitsAndBytes NF4 quantized version based on Aria-sequential_mlp, suitable for image-to-text tasks with approximately 15.5 GB VRAM requirement.
Image-to-Text
Transformers

A
leon-se
76
11
Mplug Owl3 1B 241014
Apache-2.0
mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.
Text-to-Image
Safetensors English
M
mPLUG
617
2
Mplug Owl3 2B 241014
Apache-2.0
mPLUG-Owl3 is an advanced multimodal large language model focused on addressing the challenges of long image sequence understanding, significantly improving processing speed and sequence length through the Hyper Attention mechanism.
Text-to-Image English
M
mPLUG
2,680
6
Videochat2 HD Stage4 Mistral 7B Hf
MIT
VideoChat2-HD-hf is a multimodal video understanding model based on Mistral-7B, focusing on video-to-text conversion tasks.
Video-to-Text
Safetensors
V
OpenGVLab
393
3
Qwen2 Audio 7B Instruct 4bit
This is the 4-bit quantized version of Qwen2-Audio-7B-Instruct, developed based on Alibaba Cloud's original Qwen model. It is an audio-text multimodal large language model.
Audio-to-Text
Transformers

Q
alicekyting
1,090
6
Internvideo2 Chat 8B InternLM2 5
MIT
InternVideo2-Chat-8B-InternLM2.5 is a video-text multimodal model that enhances video understanding and human-computer interaction by integrating the InternVideo2 video encoder with a large language model (LLM).
Video-to-Text
Safetensors
I
OpenGVLab
60
7
Mplug Owl3 7B 240728
Apache-2.0
mPLUG-Owl3 is a cutting-edge multimodal large language model designed to tackle the challenges of long image sequence understanding, supporting single-image, multi-image, and video tasks.
Text-to-Image
Safetensors English
M
mPLUG
4,823
39
Banban Beta V2 Gguf
AI virtual anchor BanBan model, a virtual anchor assistant designed specifically for the NTNU VLSI club, capable of image-text-to-text conversion.
Image-to-Text Supports Multiple Languages
B
asadfgglie
97
1
Llava Saiga 8b
Apache-2.0
LLaVA-Saiga-8b is a vision-language model (VLM) developed based on the IlyaGusev/saiga_llama3_8b model, primarily optimized for Russian tasks while retaining English processing capabilities.
Image-to-Text
Transformers Supports Multiple Languages

L
deepvk
205
16
Tinyllava 1.1b V0.1
Apache-2.0
A lightweight visual question answering model based on TinyLlama-1.1B, trained using the BakLlava codebase, supporting image content understanding and question-answering tasks.
Text-to-Image
Transformers

T
TitanML
27
0
Llava Calm2 Siglip
Apache-2.0
llava-calm2-siglip is an experimental vision-language model capable of answering questions about images in Japanese and English.
Image-to-Text
Transformers Supports Multiple Languages

L
cyberagent
3,930
25
Paligemma 3B Chat V0.2
A multimodal dialogue model fine-tuned based on google/paligemma-3b-mix-448, optimized for multi-turn conversation scenarios
Text-to-Image
Transformers Supports Multiple Languages

P
BUAADreamer
80
9
Vision 8B MiniCPM 2 5 Uncensored And Detailed 4bit
The int4 quantized version of MiniCPM-Llama3-V 2.5, significantly reducing GPU VRAM usage (approximately 9GB)
Text-to-Image
Transformers

V
sdasd112132
330
30
Cogvlm2 Llama3 Chat 19B Int4
Other
CogVLM2 is a multimodal dialogue model based on Meta-Llama-3-8B-Instruct, supporting both Chinese and English, with 8K context length and 1344*1344 resolution image processing capabilities.
Text-to-Image
Transformers English

C
THUDM
467
28
Minicpm Llama3 V 2 5 Int4
The int4 quantized version of MiniCPM-Llama3-V 2.5 significantly reduces GPU VRAM usage to approximately 9GB, suitable for visual question answering tasks.
Text-to-Image
Transformers

M
openbmb
17.97k
73
360VL 70B
Apache-2.0
360VL is an open-source large multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual text support capabilities.
Text-to-Image
Transformers Supports Multiple Languages

3
qihoo360
103
10
Cogvlm2 Llama3 Chinese Chat 19B
Other
CogVLM2 is a multimodal large model built on Meta-Llama-3-8B-Instruct, supporting both Chinese and English with powerful image understanding and dialogue capabilities.
Text-to-Image
Transformers English

C
THUDM
118
68
Cogvlm2 Llama3 Chat 19B
Other
CogVLM2 is a multimodal large model built upon Meta-Llama-3-8B-Instruct, supporting image understanding and dialogue tasks with 8K context length and 1344x1344 image resolution processing capability.
Text-to-Image
Transformers English

C
THUDM
7,805
212
360VL 8B
Apache-2.0
360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.
Text-to-Image
Transformers Supports Multiple Languages

3
qihoo360
22
13
Libra 11b Chat
Apache-2.0
A multimodal dialogue model developed through instruction fine-tuning based on Libra-Base, capable of image understanding and text generation
Image-to-Text
Transformers

L
YifanXu
18
0
Llava Llama 3 8b
Other
A large multimodal model trained based on the LLaVA-v1.5 framework, using the 8-billion-parameter Meta-Llama-3-8B-Instruct as the language backbone and equipped with a CLIP-based visual encoder.
Image-to-Text
Transformers

L
Intel
387
14
Llava Llama 3 8b V1 1 GGUF
LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks
Image-to-Text
L
MoMonir
138
5
Llava Llama 3 8b V1 1 Gguf
A multimodal model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image understanding and text generation
Image-to-Text
L
xtuner
9,484
216
Llava Llama 3 8b V1 1 Transformers
A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks
Image-to-Text
L
xtuner
454.61k
78
Llava Phi 3 Mini Gguf
LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.
Image-to-Text
L
xtuner
1,676
133
Llava Phi 3 Mini Hf
LLaVA model fine-tuned based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks
Image-to-Text
Transformers

L
xtuner
2,322
49
Llava Llama 3 8b V1 1 Q3 K S GGUF
This model is a GGUF format conversion based on xtuner/llava-llama-3-8b-v1_1, supporting multimodal processing of images and text.
Image-to-Text
L
djward888
17
1
- 1
- 2
- 3
Featured Recommended AI Models