llava-mini-llama-3.1-8b Open-Source Multi-Modal Model - Efficiently Achieve Image and Video Understanding

Llava Mini Llama 3.1 8b

Developed by ICTNLP

LLaVA-Mini is an efficient multimodal large model that significantly improves the efficiency of image and video understanding by using only 1 visual token to represent an image.

Image-to-Text

Safetensors

Open Source License:Gpl-3.0 #Single Visual Token #Efficient Multimodal #Video Understanding

Downloads 12.45k

Release Time : 1/7/2025

Model Overview

LLaVA-Mini is a unified multimodal large model that efficiently supports the understanding of images, high-resolution images, and videos. Guided by research on interpretability within multimodal models, LLaVA-Mini significantly enhances efficiency while maintaining visual capabilities.

Model Features

Single Visual Token Efficient Representation

Only 1 token is needed to represent each image, significantly improving processing efficiency.

Efficient Computation

Reduces floating-point operations by 77%, decreasing response latency from 100ms to 40ms.

Low GPU Memory Usage

Reduces GPU memory usage from 360MB/image to 0.6MB/image, supporting 3-hour video processing.

Unified Multimodal Processing

Unified support for understanding images, high-resolution images, and videos.

Model Capabilities

Image Understanding

Video Understanding

High-Resolution Image Processing

Multimodal Reasoning

Text Generation

Use Cases

Visual Content Analysis

Image Content Description

Analyze image content and generate descriptive text

Accurately identifies objects and scenes in images.

Video Content Understanding

Understand video content and generate summaries

Can describe the main events occurring in the video.

Interactive Applications

Visual Question Answering System

Answer user questions about image or video content

Provides accurate and contextually relevant answers.

🚀 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

LLaVA-Mini is a unified large multimodal model that efficiently supports the understanding of images, high-resolution images, and videos. Guided by the interpretability within LMM, it significantly improves efficiency while ensuring vision capabilities. The Code, model, and demo of LLaVA-Mini are now available!

🚀 Quick Start

Requirements

Install packages:

conda create -n llavamini python=3.10 -y
conda activate llavamini
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Command Interaction

Image understanding, using --image-file :

# Image Understanding
CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
    --model-path  ICTNLP/llava-mini-llama-3.1-8b \
    --image-file llavamini/serve/examples/baby_cake.png \
    --conv-mode llava_llama_3_1 --model-name "llava-mini" \
    --query "What's the text on the cake?"

Video understanding, using --video-file :

# Video Understanding
CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
    --model-path  ICTNLP/llava-mini-llama-3.1-8b \
    --video-file llavamini/serve/examples/fifa.mp4 \
    --conv-mode llava_llama_3_1 --model-name "llava-mini" \
    --query "What happened in this video?"

Reproduction and Evaluation

Refer to Evaluation.md for the evaluation of LLaVA-Mini on image/video benchmarks.

Cases

LLaVA-Mini achieves high-quality image understanding and video understanding.

case1

More cases

case2

case3

case4

LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).

compression

✨ Features

Good Performance: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
High Efficiency: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
Insights: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our paper for a detailed analysis and our conclusions.

⚠️ Important Note

LLaVA-Mini only requires 1 token to represent each image, which improves the efficiency of image and video understanding, including:

Computational effort: 77% FLOPs reduction

Response latency: reduce from 100 milliseconds to 40 milliseconds

VRAM memory usage: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing

performance

💻 Usage Examples

Basic Usage

Download LLaVA-Mini model from here.

Run these scripts and Interact with LLaVA-Mini in your browser:

# Launch a controller
python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &

# Build the API of LLaVA-Mini
CUDA_VISIBLE_DEVICES=0  python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &

# Start the interactive interface
python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload  --port 7860

llava_mini

📄 License

This project is licensed under the GPL-3.0 license.

🖋 Citation

If this repository is useful for you, please cite as:

@misc{llavamini,
      title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token}, 
      author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
      year={2025},
      eprint={2501.03895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.03895}, 
}

If you have any questions, please feel free to submit an issue or contact zhangshaolei20z@ict.ac.cn.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご