Model Selection

Multimodal Interaction

# Multimodal Interaction

Moondream 2b 2025 04 14 4bit

Moondream is a lightweight vision-language model designed for efficient cross-platform deployment. The 4-bit quantized version released on April 14, 2025 significantly reduces memory usage while maintaining high accuracy.

AgentCPM-GUI is an on-device graphical interface agent with RFT-enhanced reasoning capabilities, capable of operating Chinese and English applications, built upon the 8-billion-parameter MiniCPM-V.

Safetensors Supports Multiple Languages

UI TARS 1.5 7B 4bit

UI-TARS-1.5-7B-4bit is a multimodal model focused on image-text-to-text conversion tasks, supporting the English language.

Transformers Supports Multiple Languages

Gemma 3 12b It Qat 3bit

This is an MLX-format model converted from the Google Gemma 3-12B model, supporting image-text-to-text tasks.

Transformers Other

Videochat R1 Thinking 7B

VideoChat-R1-thinking_7B is a multimodal model based on Qwen2.5-VL-7B-Instruct, focusing on video-text-to-text tasks.

Transformers English

Jarvisvla Qwen2 VL 7B

A vision-language-action model specifically designed for Minecraft, capable of executing thousands of in-game skills based on human language commands

Transformers English

Qwen2.5 VL 3B UI R1

UI-R1 is a vision-language model enhanced by reinforcement learning for GUI agent action prediction, built upon Qwen2.5-VL-3B-Instruct.

Text-to-Image English

Vamba Qwen2 VL 7B

Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.

Smolvlm2 500M Video Instruct Mlx

This is a video-text-to-text model based on the MLX format, developed by HuggingFaceTB, supporting English language processing.

Transformers English

Ultravox V0 5 Llama 3 1 8b

Ultravox is a multimodal voice large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both voice and text inputs.

Transformers Supports Multiple Languages

Fluxi AI Small Vision

Fluxi AI is a multimodal intelligent assistant based on Qwen2-VL-7B-Instruct, capable of processing text, images, and videos, with special optimization for Portuguese language support.

Transformers Other

UGround is a powerful GUI visual positioning model trained using a simple method, jointly developed by OSUNLP and Orby AI.

Multimodal Fusion

Transformers English

Smolvlm Instruct

An intelligent vision-language model fine-tuned from HuggingFaceTB/SmolVLM-Instruct, optimized for training speed using Unsloth and TRL libraries

Transformers English

Dallah is an advanced multimodal large language model specifically designed for Arabic, with a focus on understanding and generating content across Arabic dialects.

Text-to-Image Arabic

Sam2.1 Hiera Tiny

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.

Image Segmentation

Sam2.1 Hiera Small

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.

Image Segmentation

Sam2.1 Hiera Large

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting universal segmentation tasks through prompts.

Image Segmentation

Llava Video 7B Qwen2

The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.

Transformers English

Xgen Mm Phi3 Mini Instruct Interleave R V1.5

xGen-MM is a series of the latest foundational large multimodal models (LMMs) developed by Salesforce AI Research, building upon the successful design of the BLIP series with foundational enhancements to ensure a more robust and superior model foundation.

Safetensors English

Sam2 Hiera Small

A foundational model developed by FAIR for solving promptable visual segmentation tasks in images and videos

Image Segmentation

Sam2 Hiera Tiny

SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.

Image Segmentation

Sam2 Hiera Large

A foundational model for promptable visual segmentation in images and videos developed by FAIR

Image Segmentation

UGround is a powerful GUI visual positioning model trained with a streamlined recipe, developed by the Ohio State University NLP Group in collaboration with Orby AI.

Internvideo2 Chat 8B

InternVideo2-Chat-8B is a video understanding model that combines a large language model (LLM) with video BLIP, built through a progressive learning scheme, capable of video semantic understanding and human-computer interaction.

Transformers English

Llava MORE Llama 3 1 8B Finetuning

LLaVA-MORE is an enhanced version based on the LLaVA architecture, integrating LLaMA 3.1 as the language model, focusing on image-to-text tasks.

Poppy Porpoise 0.72 L3 8B

An AI role-playing assistant based on the Llama 3 8B model, focused on creating immersive narrative experiences

Large Language Model

Poppy Porpoise V0.7 L3 8B

An AI role-playing assistant based on the Llama 3 8B model, focused on creating interactive narrative experiences

Instructblip Flan T5 Xl 8bit Nf4

InstructBLIP is a vision-instruction-tuned version based on BLIP-2, combining visual and language processing capabilities to generate responses based on images and textual instructions.

Transformers English

Instructblip Flan T5 Xl 8bit Nf4

InstructBLIP is a vision instruction tuning model based on BLIP-2, using Flan-T5-xl as the language model, capable of generating descriptions based on images and text instructions.

Transformers English

Mediocreatmybest

Instructblip Flan T5 Xxl 8bit Nf4

InstructBLIP is the vision-instruction-tuned version of BLIP-2, combining vision and language models to generate descriptions or answer questions based on images and text instructions.

Transformers English

Mediocreatmybest

IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase