llama3-mova-8b Open-Source Multimodal Large Language Model - Empowering Multimodal Research and Chatbot Development

Llama3 Mova 8b

Developed by zongzhuofan

MoVA-8B is an open-source multimodal large language model that uses a coarse-to-fine mechanism to adaptively route and fuse visual expert modules for specific tasks. It can be used for research on multimodal models and chatbots.

Multimodal Fusion

Transformers

#Multimodal Adaptive Routing #Visual Expert Fusion #Cross-Domain Visual Question Answering

Downloads 835

Release Time : 6/28/2024

Model Overview

MoVA-8B is a multimodal large language model that combines multiple visual encoders and a powerful base language model, supporting tasks such as multimodal fusion and visual question answering.

Model Features

Multimodal Fusion

Uses a coarse-to-fine mechanism to adaptively route and fuse visual expert modules for specific tasks.

Rich Visual Encoders

Integrates multiple visual encoders such as OpenAI-CLIP-336px and DINOv2-giant.

Powerful Base Large Language Model

Based on meta-llama/Meta-Llama-3-8B-Instruct, with strong language understanding and generation capabilities.

Model Capabilities

Multimodal Fusion

Visual Question Answering

Text Generation

Image Analysis

Visual Localization

Use Cases

Multimodal Research

Multimodal Chatbot

Used to build chatbots that support image and text interaction.

Visual Question Answering

Document Understanding

Used to parse and understand document content, supporting tasks such as DocVQA.

DocVQA accuracy of 83.4%

🚀 MoVA-8B Model Card

MoVA-8B is an open - source multimodal large language model. It offers a novel approach to multimodal processing, making it a valuable tool for research in related fields.

🚀 Quick Start

You can directly utilize this model as we provide in our [repository].

✨ Features

MoVA-8B is an open-source multimodal large language model (MLLM), adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism.

Vision Encoders

OpenAI-CLIP-336px, DINOv2-giant, Co-DETR-large, SAM-huge, Vary-base, Pix2Struct-large, Deplot-base, and BiomedCLIP-base.

Base LLM

meta-llama/Meta-Llama-3-8B-Instruct

Paper or resources for more information

[Paper] [Code]

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original README, so this section is skipped.

📚 Documentation

Intended use

Primary intended uses

The primary use of MoVA-8B is research on multimodal models and chatbots.

Primary intended users

The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

15M diverse visual instruction tuning samples for pre-training, including DataComp-1B, ShareGPT4V-PT, Objects365, and MMC-Instruction. Please refer to our paper for more details.
2M high-quality instruction data for fine-tuning. We integrate several visual question answering datasets across various domains, such as DocVQA, ChartQA, InfographicVQA, AI2D, ST-VQA, TextVQA, SynthDoG-en, Geometry3K, PGPS9K, Geo170K, VQA-RAD, and SLAKE into LLaVA-mix-665k. We also encompass equivalent comprehensive captions generated by GPT4-V.

Evaluation dataset

We evaluate our model on a wide range of popular MLLM benchmarks.

MultiModal Benchmark

Name	LLM	#Tokens	MME	MMBench	MMBench-CN	QBench	MathVista	MathVerse	POPE
MoVA-8B	Llama3-8B	576	1595.8 / 347.5	75.3	67.7	70.8	37.7	21.4	89.3

General & Text-oriented VQA

Name	LLM	#Tokens	VQAv2	GQA	SQA	TextVQA	ChartQA	DocVQA	AI2D
MoVA-8B	Llama3-8B	576	83.5	65.2	74.7	77.1	70.5	83.4	77.0

Visual Grounding

Name	LLM	#Tokens	RefCOCO (val)	RefCOCO (testA)	RefCOCO (testB)	RefCOCO+ (val)	RefCOCO+ (testA)	RefCOCO+ (testB)	RefCOCO‑g (val)	RefCOCO‑g (test)
MoVA-8B	Llama3-8B	576	92.18	94.75	88.24	88.45	92.21	82.82	90.05	90.23

🔧 Technical Details

No specific technical details beyond the above are provided in the original README, so this section is skipped.

📄 License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. META LLAMA 3 COMMUNITY LICENSE AGREEMENT).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご