# Multimodal large language model
Internvl3 8B Hf
Other
InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text
Transformers Other

I
OpenGVLab
454
1
Minimax VL 01
MiniMax-VL-01 is a powerful multimodal large language model that adopts the 'ViT-MLP-LLM' framework with dynamic resolution processing capabilities, demonstrating excellent performance in various vision-language tasks.
Image-to-Text
M
MiniMaxAI
237
253
Llava UHD V2 Vicuna 7B
LLaVA-UHD v2 is an advanced multimodal large language model built around a hierarchical window transformer, capable of capturing different visual granularities through a high-resolution feature pyramid.
Multimodal Fusion
Transformers

L
YipengZhang
103
6
Auroracap 7B VID Xtuner
Apache-2.0
AuroraCap is a multimodal large language model for image and video captioning, focusing on efficient and detailed video caption generation.
Video-to-Text
A
wchai
31
5
Featured Recommended AI Models