TinyLLaVA-3.1B Open-source Multimodal Model - Small Scale, High Performance, Surpassing Similar 7B Models

Tinyllava 3.1B

Developed by bczhou

TinyLLaVA is a small-scale large multimodal model framework that significantly reduces the number of parameters while maintaining high performance. The 3.1B version outperforms similar 7B-scale models in multiple benchmarks.

Text-to-Image

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Small-scale Multimodal #Vision-Language Understanding #Efficient Inference

Downloads 184

Release Time : 2/22/2024

Model Overview

TinyLLaVA is an efficient multimodal model framework focused on vision-language understanding tasks, maintaining excellent performance while reducing parameter count through a carefully designed architecture.

Model Features

Efficient Small-scale Architecture

Only 3.1B parameters yet outperforms 7B-scale models

Multimodal Capabilities

Processes both visual and language inputs for cross-modal understanding

Bilingual Support

Natively supports English and Chinese vision-language tasks

Open Source Availability

Licensed under Apache-2.0, allowing commercial and research use

Model Capabilities

Image understanding and description

Visual question answering

Multimodal dialogue

Cross-modal reasoning

Text generation

Use Cases

Intelligent Assistants

Image Content Description

Describing image content for visually impaired users

Achieved 75.8 points on LLaVA-Bench-Wild

Visual Question Answering System

Answering complex questions about image content

Achieved 79.9 points on VQA-v2

Educational Applications

Scientific Diagram Analysis

Helping students understand complex scientific diagrams

Achieved 66.9 points on MMBench

🚀 TinyLLaVA: A Framework of Small-scale Large Multimodal Models

TinyLLaVA is a framework for small - scale large multimodal models. It offers high - performance models with fewer parameters, providing an efficient solution for multimodal tasks.

🚀 Quick Start

Requirements and Installation

We recommend the following requirements:

Clone this repository and navigate to the LLaVA folder:

git clone https://github.com/DLCV-BUAA/TinyLLaVABench.git
cd TinyLLaVABench

Install the package:

conda create -n tinyllava python=3.10 -y
conda activate tinyllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases:

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Upgrade to the latest code base

git pull
pip install -e .

# if you see some import errors when you upgrade, please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir

Load model

from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

✨ Features

High performance, but with fewer parameters

Our best model, TinyLLaVA - 3.1B, achieves better overall performance against existing 7B models such as LLaVA - 1.5 and Qwen - VL.

📦 Installation

The installation steps are included in the "Quick Start" section above.

💻 Usage Examples

Gradio Web Demo

Launch a local web demo by running:

python tinyllava/serve/app.py --model-path bczhou/TinyLLaVA-3.1B --model-name TinyLLaVA-3.1B

CLI Inference

We also support running inference with CLI. To use our model, run:

python -m tinyllava.serve.cli \
    --model-path bczhou/TinyLLaVA-3.1B \
    --image-file "./tinyllava/serve/examples/extreme_ironing.jpg"

Run Inference

Here's an example of running inference with TinyLLaVA - 3.1B:

from tinyllava.model.builder import load_pretrained_model
from tinyllava.mm_utils import get_model_name_from_path
from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": "phi",
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)

⚠️ Important Note

We use different conv_mode for different models. Replace the conv_mode in args according to this table: | model | conv_mode | |---------------- |----------- | | TinyLLaVA-3.1B | phi | | TinyLLaVA-2.0B | phi | | TinyLLaVA-1.5B | v1 |

📚 Documentation

Model Zoo

Legacy Model

tiny-llava-hf

Pretrained Models

Model Details

Name	LLM	Checkpoint	LLaVA-Bench-Wild	MME	MMBench	MM-Vet	SQA-image	VQA-v2	GQA	TextVQA
TinyLLaVA-3.1B	Phi-2	TinyLLaVA-3.1B	75.8	1464.9	66.9	32.0	69.1	79.9	62.0	59.1
TinyLLaVA-2.0B	StableLM-2-1.6B	TinyLLaVA-2.0B	66.4	1433.8	63.3	32.6	64.7	78.9	61.9	56.4
TinyLLaVA-1.5B	TinyLlama	TinyLLaVA-1.5B	60.8	1276.5	55.2	25.8	60.3	76.9	60.3	51.7

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding. See Evaluation.md.

Data Preparation

In our paper, we used two different datasets: the LLaVA dataset and the ShareGPT4V dataset, and compared their differences.

Pretraining Images

LLaVA: The pretraining images of LLaVA are from the 558K subset of the LAION - CC - SBU dataset.
ShareGPT4V: The pretraining images of ShareGPT4V are a mixture of 558K LAION - CC - SBU subset, SAM dataset, and COCO dataset.

Pretraining Annotations

LLaVA: The pretraining annotations of LLaVA are here.
ShareGPT4V: The pretraining annotations of ShareGPT4V are here.

SFT Images & Annotations

The majority of the two SFT datasets are the same, with the exception that the 23K detailed description data in LLaVA - 1.5 - SFT is replaced with detailed captions randomly sampled from the 100K ShareGPT4V data.

Download data

Download relevant images:
- LAION - CC - SBU - 558K: images.zip
- COCO: This dataset is from the COCO2017 challenge. Download: train2017
- WebData: This dataset is curated by the ShareGPT4V project. Download: [images](https://drive.google.com/drive/folders/1tCUQ - sq6vdshZVkF0ZeF3K4eztkXJgax?usp=sharing). Only for academic usage.
- SAM: This dataset is collected by Meta. Download: images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from here.
- GQA: GQA project page. Download: images
- OCR - VQA: OCR - VQA project page. Download: download script. We save all files as .jpg
- TextVQA: TextVQA project page. Download: trainvalimages
- VisualGenome: VisualGenome project page. Download: part1, part2
Download relevant annotations:
- LLaVA's pretraining annotations: blip_laion_cc_sbu_558k.json
- LLaVA's SFT annotations: llava_v1_5_mix665k.json
- ShareGPT4V's pretraining annotations: share-captioner_coco_lcs_sam_1246k_1107.json
- ShareGPT4V's SFT annotations: sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json

Organize Data

Organize the image files and annotation files as follows in path/to/your/data:

data
├── llava
│   ├── llava_pretrain
│   │   ├── images
│   │   ├── blip_laion_cc_sbu_558k.json
├── coco
│   ├── train2017
├── sam
│   ├── images
├── gqa
│   ├── images
├── ocr_vqa
│   ├── images
├── textvqa
│   ├── train_images
├── vg
│   ├── VG_100K
│   ├── VG_100K_2
├── share_textvqa
│   ├── images
├── web-celebrity
│   ├── images
├── web-landmark
│   ├── images
├── wikiart
│   ├── images
├── text_files
│   ├── llava_v1_5_mix665k.json
│   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json

Train

This section describes the base recipe.

Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Pretraining | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |----------------| ---: | ---: | ---: |-----------:| ---: | | TinyLLaVA-3.1B | 256 | 1e-3 | 1 | 3072 | 0 |
Finetuning | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |----------------| ---: | ---: | ---: |-----------:| ---: | | TinyLLaVA-3.1B | 128 | 2e-5 | 1 | 3072 | 0 |

Pretrain

Replace paths with your own paths. Training script with DeepSpeed ZeRO - 2: pretrain.sh.

Finetune

Replace paths with your own paths. Training script with DeepSpeed ZeRO - 3: finetune.sh.

Custom - Finetune

Check out our custom finetune using LoRA here.

🔧 Technical Details

The project uses different techniques for model training and evaluation, such as using greedy decoding for evaluation to ensure reproducibility. It also uses different datasets for pretraining and finetuning, and provides detailed hyperparameters for both processes.

📄 License

This project is licensed under the Apache 2.0 license.

🎉 News

[2024.03.10] base recipe out!
[2024.03.10] Finetune scripts out!
[2024.02.25] Update evaluation scripts and docs!
[2024.02.25] Data descriptions out. Release TinyLLaVA-1.5B and TinyLLaVA-2.0B!
[2024.02.24] Example code on inference and model loading added!
[2024.02.23] Evaluation code and scripts released!
[2024.02.21] Creating the TinyLLaVABench repository on GitHub!
[2024.02.21] Our paper: TinyLLaVA: A Framework of Small-scale Large Multimodal Models is out!
[2024.01.11] Our first model TinyLLaVA-1.4B is out!

📋 TODO

[ ] Add support for Ollama and llama.cpp.
[x] Developers' guide / How to build demo locally.
[x] Training and custom finetuning docs.
[x] Model Zoo descriptions.
[x] Examples and inference.
[x] Release code for training.
[x] Add descriptions for evaluation.
[x] Add descriptions for data preparation.
[x] Release TinyLLaVA-1.5B and TinyLLaVA-2.0B.
[x] Release TinyLLaVA-3.1B.
[x] Release the evaluation code and weights today(2024.2.23).

📝 Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@misc{zhou2024tinyllava,
      title={TinyLLaVA: A Framework of Small-scale Large Multimodal Models}, 
      author={Baichuan Zhou and Ying Hu and Xi Weng and Junlong Jia and Jie Luo and Xien Liu and Ji Wu and Lei Huang},
      year={2024},
      eprint={2402.14289},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

❤️ Community efforts

Our codebase is built upon the LLaVA project. Great work!
Our project uses data from the ShareGPT4V project.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご