Tiny-LLaVA-v1-hf Open-Source Multimodal Model - Free Deployment to Aid Vision-Language Task Processing

Tiny Llava V1 Hf

Developed by bczhou

TinyLLaVA is a compact large-scale multimodal model framework focused on vision-language tasks, featuring small parameter size yet excellent performance.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Lightweight Multimodal #Visual Language Understanding #Efficient Small Model

Downloads 2,372

Release Time : 1/11/2024

Model Overview

TinyLLaVA is an efficient multimodal model capable of handling image-to-text generation tasks, supporting both English and Chinese, with outstanding performance across multiple benchmarks.

Model Features

High-Performance Small-Scale Model

The 3.1B-parameter TinyLLaVA outperforms 7B-parameter models like LLaVA-1.5 and Qwen-VL in performance

Multimodal Capabilities

Supports image understanding and text generation, capable of handling complex vision-language tasks

Efficient Inference

Small parameter size enables faster inference speed and lower resource consumption

Model Capabilities

Image understanding

Visual question answering

Image caption generation

Multimodal reasoning

Use Cases

Visual Question Answering

Image content Q&A

Answer various questions about image content

Achieves 79.9% accuracy on VQA-v2 dataset

Image Captioning

Automatic image annotation

Generate detailed descriptive text for images

Scores 75.8 on LLaVA-Bench-Wild

🚀 TinyLLaVA: A Framework of Small-scale Large Multimodal Models

TinyLLaVA is a framework for small - scale large multimodal models. It offers high - performance models with fewer parameters, achieving better overall performance than some existing 7B models.

🚀 Quick Start

Prerequisites

We recommend the requirements as follows.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/DLCV-BUAA/TinyLLaVABench.git
cd TinyLLaVABench

Install Package

conda create -n tinyllava python=3.10 -y
conda activate tinyllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Upgrade to the latest code base

git pull
pip install -e .

# if you see some import errors when you upgrade, please try running the command below (without #)
# pip install flash-attn --no-build-isolation --no-cache-dir

Load model

```Python from tinyllava.model.builder import load_pretrained_model from tinyllava.mm_utils import get_model_name_from_path from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B"

tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) )

</details>

## ✨ Features
### ⚡ High performance, but with fewer parameters
Our best model, TinyLLaVA - 3.1B, achieves better overall performance against existing 7B models such as LLaVA - 1.5 and Qwen - VL.

## 📦 Installation
The installation steps are included in the Quick Start section above.

## 💻 Usage Examples
### Basic Usage
#### Gradio Web Demo
Launch a local web demo by running:
```shell
python tinyllava/serve/app.py --model-path bczhou/TinyLLaVA-3.1B --model-name TinyLLaVA-3.1B

CLI Inference

We also support running inference with CLI. To use our model, run:

python -m tinyllava.serve.cli \
    --model-path bczhou/TinyLLaVA-3.1B \
    --image-file "./tinyllava/serve/examples/extreme_ironing.jpg"

Advanced Usage

Run Inference

Here's an example of running inference with TinyLLaVA - 3.1B

Run Inference

```Python from tinyllava.model.builder import load_pretrained_model from tinyllava.mm_utils import get_model_name_from_path from tinyllava.eval.run_tiny_llava import eval_model

model_path = "bczhou/TinyLLaVA-3.1B" prompt = "What are the things I should be cautious about when I visit here?" image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), { "model_path": model_path, "model_base": None, "model_name": get_model_name_from_path(model_path), "query": prompt, "conv_mode": "phi", "image_file": image_file, "sep": ",", "temperature": 0, "top_p": None, "num_beams": 1, "max_new_tokens": 512 })()

eval_model(args)

</details>

### Important
We use different `conv_mode` for different models. Replace the `conv_mode` in `args` according to this table:
| model           | conv_mode  |
|-----------------|------------|
| TinyLLaVA-3.1B  | phi        |
| TinyLLaVA-2.0B  | phi        |
| TinyLLaVA-1.5B  | v1         |

## 📚 Documentation
### Model Zoo
#### Legacy Model
- [tiny-llava-hf](https://huggingface.co/bczhou/tiny-llava-v1-hf)

#### Pretrained Models
- [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B)
- [TinyLLaVA-2.0B](https://huggingface.co/bczhou/TinyLLaVA-2.0B)
- [TinyLLaVA-1.5B](https://huggingface.co/bczhou/TinyLLaVA-1.5B)

#### Model Details
| Name          | LLM               | Checkpoint                                     | LLaVA - Bench - Wild | MME      | MMBench | MM - Vet | SQA - image | VQA - v2 | GQA   | TextVQA |
|---------------|-------------------|------------------------------------------------|----------------------|----------|---------|----------|-------------|----------|-------|---------|
| TinyLLaVA-3.1B | Phi - 2             | [TinyLLaVA-3.1B](https://huggingface.co/bczhou/TinyLLaVA-3.1B) | 75.8                 | 1464.9   | 66.9    | 32.0     | 69.1        | 79.9     | 62.0  | 59.1    |
| TinyLLaVA-2.0B | StableLM - 2 - 1.6B   | [TinyLLaVA-2.0B](https://huggingface.co/bczhou/TinyLLaVA-2.0B) | 66.4                 | 1433.8   | 63.3    | 32.6     | 64.7        | 78.9     | 61.9  | 56.4    |
| TinyLLaVA-1.5B | TinyLlama         | [TinyLLaVA-1.5B](https://huggingface.co/bczhou/TinyLLaVA-1.5B) | 60.8                 | 1276.5   | 55.2    | 25.8     | 60.3        | 76.9     | 60.3  | 51.7    |


### Evaluation
To ensure the reproducibility, we evaluate the models with greedy decoding.
See [Evaluation.md](https://github.com/DLCV-BUAA/TinyLLaVABench/blob/main/docs/Evaluation.md)

### Data Preparation
In our paper, we used two different datasets: the [LLaVA dataset](https://github.com/haotian - liu/LLaVA?tab=readme - ov - file#pretrain - feature - alignment) and the [ShareGPT4V dataset](https://github.com/InternLM/InternLM - XComposer/blob/main/projects/ShareGPT4V/docs/Data.md), and compared their differences. In this section, we provide information on data preparation.

#### Pretraining Images
- LLaVA: The pretraining images of LLaVA is from the 558K subset of the LAION - CC - SBU dataset.
- ShareGPT4V: The pretraining images of ShareGPT4V is a mixture of 558K LAION - CC - SBU subset, SAM dataset, and COCO dataset.

#### Pretraining Annotations
- LLaVA: The pretraining annotations of LLaVA are [here](https://huggingface.co/datasets/liuhaotian/LLaVA - Pretrain).
- ShareGPT4V: The pretraining annotations of ShareGPT4V are [here](https://huggingface.co/datasets/Lin - Chen/ShareGPT4V/blob/main/share - captioner_coco_lcs_sam_1246k_1107.json).

#### SFT Images & Annotations
The majority of the two SFT datasets are the same, with the exception that the 23K detailed description data in LLaVA - 1.5 - SFT being replaced with detailed captions randomly sampled from the [100K ShareGPT4V data](https://huggingface.co/datasets/Lin - Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4 - vision_cap100k.json).

#### Download data
1. Download relevant images
- LAION - CC - SBU - 558K: [images.zip](https://huggingface.co/datasets/liuhaotian/LLaVA - Pretrain/blob/main/images.zip)
- COCO: This dataset is from the [COCO2017 challenge](https://cocodataset.org/). Download: [train2017](http://images.cocodataset.org/zips/train2017.zip)
- WebData: This dataset is curated by the [ShareGPT4V project](https://github.com/InternLM/InternLM - XComposer/tree/main/projects/ShareGPT4V). Download: [images](https://drive.google.com/drive/folders/1tCUQ - sq6vdshZVkF0ZeF3K4eztkXJgax?usp=sharing). Only for academic usage.
- SAM: This dataset is collected by [Meta](https://ai.meta.com/datasets/segment - anything - downloads/). Download: [images](https://ai.meta.com/datasets/segment - anything - downloads/). We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from [here](https://drive.google.com/file/d/1dKumdOKSXtV7lIXdrG7jsIK_z2vZv2gs/view?usp=drive_link).
- GQA: [GQA project page](https://cs.stanford.edu/people/dorarad/gqa/about.html). Download: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- OCR - VQA: [OCR - VQA project page](https://ocr - vqa.github.io/). Download: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing). We save all files as `.jpg`
- TextVQA: [TextVQA project page](https://textvqa.org/). Download: [trainvalimages](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
- VisualGenome: [VisualGenome project page](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html). Download: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)

2. Download relevant annotations
- LLaVA's pretraining annotations: [blip_laion_cc_sbu_558k.json](https://huggingface.co/datasets/liuhaotian/LLaVA - Pretrain)
- LLaVA's SFT annotations: [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA - Instruct - 150K/blob/main/llava_v1_5_mix665k.json)
- ShareGPT4V's pretraining annotations: [share - captioner_coco_lcs_sam_1246k_1107.json](https://huggingface.co/datasets/Lin - Chen/ShareGPT4V/blob/main/share - captioner_coco_lcs_sam_1246k_1107.json)
- ShareGPT4V's SFT annotations: [sharegpt4v_mix665k_cap23k_coco - ap9k_lcs3k_sam9k_div2k.json](https://huggingface.co/datasets/Lin - Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco - ap9k_lcs3k_sam9k_div2k.json)

#### Organize Data
Organize the image files and annotation files as follows in `path/to/your/data`:
```none
data
├── llava
│   ├── llava_pretrain
│   │   ├── images
│   │   ├── blip_laion_cc_sbu_558k.json
├── coco
│   ├── train2017
├── sam
│   ├── images
├── gqa
│   ├── images
├── ocr_vqa
│   ├── images
├── textvqa
│   ├── train_images
├── vg
│   ├── VG_100K
│   ├── VG_100K_2
├── share_textvqa
│   ├── images
├── web - celebrity
│   ├── images
├── web - landmark
│   ├── images
├── wikiart
│   ├── images
├── text_files
│   ├── llava_v1_5_mix665k.json
│   ├── share - captioner_coco_lcs_sam_1246k_1107.json
│   ├── sharegpt4v_mix665k_cap23k_coco - ap9k_lcs3k_sam9k_div2k.json

Train

This section we describe the base recipe.

Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Pretraining | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |----------------| ---: | ---: | ---: |-----------:| ---: | | TinyLLaVA-3.1B | 256 | 1e - 3 | 1 | 3072 | 0 |
Finetuning | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | |----------------| ---: | ---: | ---: |-----------:| ---: | | TinyLLaVA-3.1B | 128 | 2e - 5 | 1 | 3072 | 0 |

Pretrain

Replace paths to your paths Training script with DeepSpeed ZeRO - 2: pretrain.sh.

Finetune

Replace paths to your paths Training script with DeepSpeed ZeRO - 3: finetune.sh.

Custom - Finetune

Check out our custom finetune using LoRA here.

- Prompt Template

The model supports multi - image and multi - prompt generation. When using the model, make sure to follow the correct prompt template (USER: <image>xxx\nASSISTANT:), where <image> token is a place - holding special token for image embeddings.

Model Inference from `pipeline` and `transformers`

- Using `pipeline`:

Below we used ["bczhou/tiny - llava - v1 - hf"](https://huggingface.co/bczhou/tiny - llava - v1 - hf) checkpoint.

from transformers import pipeline
from PIL import Image
import requests
model_id = "bczhou/tiny - llava - v1 - hf"
pipe = pipeline("image - to - text", model = model_id)
url = "https://huggingface.co/datasets/huggingface/documentat

🔧 Technical Details

The technical details are reflected in the model's architecture, hyperparameters, and training methods described above.

📄 License

This project is licensed under the Apache - 2.0 license.

📢 News

[2024.03.10] base recipe out!
[2024.03.10] Finetune scripts out!
[2024.02.25] Update evaluation scripts and docs!
[2024.02.25] Data descriptions out. Release TinyLLaVA - 1.5B and TinyLLaVA - 2.0B!
[2024.02.24] Example code on inference and model loading added!
[2024.02.23] Evaluation code and scripts released!
[2024.02.21] Creating the TinyLLaVABench repository on GitHub!
[2024.02.21] Our paper: TinyLLaVA: A Framework of Small - scale Large Multimodal Models is out!
[2024.01.11] Our fist model [TinyLLaVA - 1.4B](https://huggingface.co/bczhou/tiny - llava - v1 - hf) is out!

📋 TODO

[ ] Add support for Ollama and llama.cpp.
[x] Developers' guide / How to build demo locally.
[x] Training and custom finetuning docs.
[x] Model Zoo descriptions.
[x] Examples and inference.
[x] Release code for training.
[x] Add descriptions for evaluation.
[x] Add descriptions for data preparation.
[x] Release TinyLLaVA - 1.5B and TinyLLaVA - 2.0B.
[x] Release TinyLLaVA - 3.1B.
[x] Release the evaluation code and weights today(2024.2.23).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Tiny Llava V1 Hf

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 TinyLLaVA: A Framework of Small-scale Large Multimodal Models

🚀 Quick Start

Prerequisites

Upgrade to the latest code base

Load model

CLI Inference

Advanced Usage

Run Inference

Train

Hyperparameters

Pretrain

Finetune

Custom - Finetune

- Prompt Template

Model Inference from pipeline and transformers

- Using pipeline:

🔧 Technical Details

📄 License

📢 News

📋 TODO

Model Inference from `pipeline` and `transformers`

- Using `pipeline`: