M

Moe LLaVA Qwen 1.8B 4e

Developed by LanguageBind
MoE-LLaVA is a large vision-language model based on the Mixture of Experts architecture, achieving efficient multimodal learning through sparse activation parameters
Downloads 176
Release Time : 1/23/2024

Model Overview

MoE-LLaVA combines visual and language understanding capabilities, utilizing the Mixture of Experts architecture for efficient multimodal interaction while maintaining high performance with reduced parameters

Model Features

Efficient Parameter Utilization
Achieves performance comparable to 7B dense models with only 3 billion sparse activation parameters
Rapid Training
Training completed in 2 days using 8 V100 GPUs
Outstanding Performance
Surpasses larger-scale models on multiple visual understanding tasks

Model Capabilities

Visual Question Answering
Image Understanding
Multimodal Reasoning
Object Recognition
Image Caption Generation

Use Cases

Intelligent Assistant
Image Content Q&A
Answering various user questions about image content
Surpasses LLaVA-1.5-13B on object hallucination benchmarks
Content Understanding
Complex Scene Understanding
Understanding complex scene images containing multiple objects
Achieves comparable performance to LLaVA-1.5-7B on multiple visual understanding datasets
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase