# Multimodal Interaction
Moondream 2b 2025 04 14 4bit
Apache-2.0
Moondream is a lightweight vision-language model designed for efficient cross-platform deployment. The 4-bit quantized version released on April 14, 2025 significantly reduces memory usage while maintaining high accuracy.
Image-to-Text
Safetensors
M
moondream
6,037
38
Agentcpm GUI
Apache-2.0
AgentCPM-GUI is an on-device graphical interface agent with RFT-enhanced reasoning capabilities, capable of operating Chinese and English applications, built upon the 8-billion-parameter MiniCPM-V.
Image-to-Text
Safetensors Supports Multiple Languages
A
openbmb
541
94
UI TARS 1.5 7B 4bit
Apache-2.0
UI-TARS-1.5-7B-4bit is a multimodal model focused on image-text-to-text conversion tasks, supporting the English language.
Image-to-Text
Transformers Supports Multiple Languages

U
mlx-community
184
1
Gemma 3 12b It Qat 3bit
Other
This is an MLX-format model converted from the Google Gemma 3-12B model, supporting image-text-to-text tasks.
Image-to-Text
Transformers Other

G
mlx-community
65
1
Videochat R1 Thinking 7B
Apache-2.0
VideoChat-R1-thinking_7B is a multimodal model based on Qwen2.5-VL-7B-Instruct, focusing on video-text-to-text tasks.
Video-to-Text
Transformers English

V
OpenGVLab
800
0
Jarvisvla Qwen2 VL 7B
MIT
A vision-language-action model specifically designed for Minecraft, capable of executing thousands of in-game skills based on human language commands
Image-to-Text
Transformers English

J
CraftJarvis
163
8
Qwen2.5 VL 3B UI R1
MIT
UI-R1 is a vision-language model enhanced by reinforcement learning for GUI agent action prediction, built upon Qwen2.5-VL-3B-Instruct.
Text-to-Image English
Q
LZXzju
96
6
Vamba Qwen2 VL 7B
MIT
Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.
Video-to-Text
Transformers

V
TIGER-Lab
806
16
Smolvlm2 500M Video Instruct Mlx
Apache-2.0
This is a video-text-to-text model based on the MLX format, developed by HuggingFaceTB, supporting English language processing.
Image-to-Text
Transformers English

S
mlx-community
2,491
12
Ultravox V0 5 Llama 3 1 8b
MIT
Ultravox is a multimodal voice large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both voice and text inputs.
Text-to-Audio
Transformers Supports Multiple Languages

U
fixie-ai
17.86k
12
Fluxi AI Small Vision
Apache-2.0
Fluxi AI is a multimodal intelligent assistant based on Qwen2-VL-7B-Instruct, capable of processing text, images, and videos, with special optimization for Portuguese language support.
Image-to-Text
Transformers Other

F
JJhooww
25
2
Uground V1 2B
Apache-2.0
UGround is a powerful GUI visual positioning model trained using a simple method, jointly developed by OSUNLP and Orby AI.
Multimodal Fusion
Transformers English

U
osunlp
975
8
Smolvlm Instruct
Apache-2.0
An intelligent vision-language model fine-tuned from HuggingFaceTB/SmolVLM-Instruct, optimized for training speed using Unsloth and TRL libraries
Text-to-Image
Transformers English

S
mjschock
18
2
Dallah Llama
Dallah is an advanced multimodal large language model specifically designed for Arabic, with a focus on understanding and generating content across Arabic dialects.
Text-to-Image Arabic
D
alielfilali01
17
0
Sam2.1 Hiera Tiny
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.
Image Segmentation
S
facebook
12.90k
9
Sam2.1 Hiera Small
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.
Image Segmentation
S
facebook
7,333
6
Sam2.1 Hiera Large
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting universal segmentation tasks through prompts.
Image Segmentation
S
facebook
203.27k
81
Llava Video 7B Qwen2
Apache-2.0
The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.
Video-to-Text
Transformers English

L
lmms-lab
34.28k
91
Xgen Mm Phi3 Mini Instruct Interleave R V1.5
Apache-2.0
xGen-MM is a series of the latest foundational large multimodal models (LMMs) developed by Salesforce AI Research, building upon the successful design of the BLIP series with foundational enhancements to ensure a more robust and superior model foundation.
Image-to-Text
Safetensors English
X
Salesforce
7,373
51
Sam2 Hiera Small
Apache-2.0
A foundational model developed by FAIR for solving promptable visual segmentation tasks in images and videos
Image Segmentation
S
facebook
12.98k
13
Sam2 Hiera Tiny
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.
Image Segmentation
S
facebook
41.88k
20
Sam2 Hiera Large
Apache-2.0
A foundational model for promptable visual segmentation in images and videos developed by FAIR
Image Segmentation
S
facebook
155.85k
68
Uground
UGround is a powerful GUI visual positioning model trained with a streamlined recipe, developed by the Ohio State University NLP Group in collaboration with Orby AI.
Image-to-Text
U
osunlp
208
23
Internvideo2 Chat 8B
MIT
InternVideo2-Chat-8B is a video understanding model that combines a large language model (LLM) with video BLIP, built through a progressive learning scheme, capable of video semantic understanding and human-computer interaction.
Video-to-Text
Transformers English

I
OpenGVLab
492
22
Llava MORE Llama 3 1 8B Finetuning
Apache-2.0
LLaVA-MORE is an enhanced version based on the LLaVA architecture, integrating LLaMA 3.1 as the language model, focusing on image-to-text tasks.
Image-to-Text
Transformers

L
aimagelab
215
9
Poppy Porpoise 0.72 L3 8B
Other
An AI role-playing assistant based on the Llama 3 8B model, focused on creating immersive narrative experiences
Large Language Model
Transformers

P
Nitral-AI
41
32
Poppy Porpoise V0.7 L3 8B
Other
An AI role-playing assistant based on the Llama 3 8B model, focused on creating interactive narrative experiences
Text-to-Image
Transformers

P
Nitral-AI
32
47
Instructblip Flan T5 Xl 8bit Nf4
MIT
InstructBLIP is a vision-instruction-tuned version based on BLIP-2, combining visual and language processing capabilities to generate responses based on images and textual instructions.
Image-to-Text
Transformers English

I
benferns
20
0
Instructblip Flan T5 Xl 8bit Nf4
MIT
InstructBLIP is a vision instruction tuning model based on BLIP-2, using Flan-T5-xl as the language model, capable of generating descriptions based on images and text instructions.
Image-to-Text
Transformers English

I
Mediocreatmybest
22
0
Instructblip Flan T5 Xxl 8bit Nf4
MIT
InstructBLIP is the vision-instruction-tuned version of BLIP-2, combining vision and language models to generate descriptions or answer questions based on images and text instructions.
Image-to-Text
Transformers English

I
Mediocreatmybest
22
1
Idefics 80b
Other
IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.
Image-to-Text
Transformers English

I
HuggingFaceM4
70
70
Featured Recommended AI Models