L

Llama3 Mova 8b

Developed by zongzhuofan
MoVA-8B is an open-source multimodal large language model that uses a coarse-to-fine mechanism to adaptively route and fuse visual expert modules for specific tasks. It can be used for research on multimodal models and chatbots.
Downloads 835
Release Time : 6/28/2024

Model Overview

MoVA-8B is a multimodal large language model that combines multiple visual encoders and a powerful base language model, supporting tasks such as multimodal fusion and visual question answering.

Model Features

Multimodal Fusion
Uses a coarse-to-fine mechanism to adaptively route and fuse visual expert modules for specific tasks.
Rich Visual Encoders
Integrates multiple visual encoders such as OpenAI-CLIP-336px and DINOv2-giant.
Powerful Base Large Language Model
Based on meta-llama/Meta-Llama-3-8B-Instruct, with strong language understanding and generation capabilities.

Model Capabilities

Multimodal Fusion
Visual Question Answering
Text Generation
Image Analysis
Visual Localization

Use Cases

Multimodal Research
Multimodal Chatbot
Used to build chatbots that support image and text interaction.
Visual Question Answering
Document Understanding
Used to parse and understand document content, supporting tasks such as DocVQA.
DocVQA accuracy of 83.4%
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase