# Multimodal Learning

Openvision Vit So400m Patch14 384
Apache-2.0
OpenVision is a fully open, cost-effective family of advanced vision encoders for multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
238
0
Openvision Vit Base Patch16 160
Apache-2.0
OpenVision is a fully open-source, cost-effective family of advanced vision encoders for multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
15
0
Openvision Vit Small Patch8 384
Apache-2.0
OpenVision is a fully open, cost-effective family of advanced vision encoders focused on multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
21
0
Openvision Vit Small Patch16 224
Apache-2.0
OpenVision is a fully open, cost-effective family of advanced vision encoders focused on multimodal learning.
Image Enhancement
O
UCSC-VLAA
17
0
Med Dis B
A PyTorch-based action recognition model for robotics applications
Video Processing
M
therarelab
14
0
Wedgit Stack Single Fixed
A robot control model based on diffusion policy, released via PyTorchModelHubMixin integration
Multimodal Fusion
W
jclinton1
76
0
Genmedclip B 16 PMB
MIT
A zero-shot image classification model based on the open_clip library, specializing in medical field image analysis
Image Classification
G
wisdomik
408
0
Genmedclip
MIT
GenMedClip is a zero-shot image classification model based on the open_clip library, specializing in medical image analysis.
Image Classification
G
wisdomik
40
0
Moe LLaVA Qwen 1.8B 4e
Apache-2.0
MoE-LLaVA is a large vision-language model based on the Mixture of Experts architecture, achieving efficient multimodal learning through sparse activation parameters
Text-to-Image Transformers
M
LanguageBind
176
14
Echo Clip R
MIT
A zero-shot image classification model based on the Open CLIP library, supporting various vision tasks
Image Classification
E
mkaichristensen
547
4
Git 20
MIT
A multimodal model based on Microsoft's GIT framework, focused on extracting text from student homework images and generating teacher feedback
Image-to-Text Transformers Supports Multiple Languages
G
uf-aice-lab
18
1
Git Base Textvqa
MIT
A visual question answering model fine-tuned on the textvqa dataset based on microsoft/git-base-textvqa, excelling at handling image-based question answering tasks involving text
Large Language Model Transformers Other
G
Hellraiser24
19
0
Dof Passport 1
MIT
A model fine-tuned based on naver-clova-ix/donut-base, specific purpose not explicitly stated
Image-to-Text Transformers
D
Sebabrata
16
0
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase