nanoLLaVA Open-Source Vision-Language Model - Designed for Edge Devices and Can Run Efficiently!

Nanollava

Developed by qnguyen3

nanoLLaVA is a 1B-parameter vision-language model specifically designed for edge devices, featuring efficient operation.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Edge Device Visual Question Answering #Lightweight Multimodal #Efficient Vision-Language Model

Downloads 2,851

Release Time : 4/4/2024

Model Overview

nanoLLaVA is a compact yet powerful vision-language model built upon Qwen1.5-0.5B and SigLIP visual encoder, suitable for multimodal tasks.

Model Features

Efficient Edge Computing

Designed for efficient operation on edge devices, with a small parameter size yet powerful performance.

Multimodal Capabilities

Combines visual and language understanding abilities to handle joint tasks involving images and text.

Improved Version

The nanoLLaVA-1.5 version has been released, with significantly enhanced performance.

Model Capabilities

Visual Question Answering

Image Caption Generation

Multimodal Understanding

Text Generation

Image Analysis

Use Cases

Smart Assistants

Image Content Description

Generates detailed descriptions based on user-provided images

Accurately identifies content and contextual relationships within images

Education

Scientific Question Answering

Answers science-related questions involving images

Achieves 58.97% accuracy on the ScienceQA dataset

🚀 nanoLLaVA - Sub 1B Vision-Language Model

nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.

IMPORTANT: nanoLLaVA-1.5 is out with a much better performance. Please find it here.

Logo

🚀 Quick Start

You can use nanoLLaVA with transformers using the following steps. First, install the necessary libraries:

pip install -U transformers accelerate flash_attn

Then, use the following Python script to interact with the model:

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

# create model
model = AutoModelForCausalLM.from_pretrained(
    'qnguyen3/nanoLLaVA',
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'qnguyen3/nanoLLaVA',
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

✨ Features

nanoLLaVA is a "small but mighty" 1B vision - language model designed to run efficiently on edge devices.
Base LLM: Quyen-SE-v0.1 (Qwen1.5-0.5B)
Vision Encoder: google/siglip-so400m-patch14-384

Model	VQA v2	TextVQA	ScienceQA	POPE	MMMU (Test)	MMMU (Eval)	GQA	MM-VET
Score	70.84	46.71	58.97	84.1	28.6	30.4	54.79	23.9

📦 Installation

pip install -U transformers accelerate flash_attn

💻 Usage Examples

Basic Usage

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

# create model
model = AutoModelForCausalLM.from_pretrained(
    'qnguyen3/nanoLLaVA',
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'qnguyen3/nanoLLaVA',
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

📚 Documentation

Prompt Format

The model follow the ChatML standard, however, without \n at the end of <|im_end|>:

<|im_start|>system
Answer the question<|im_end|><|im_start|>user
<image>
What is the picture about?<|im_end|><|im_start|>assistant

Image	Example
	What is the text saying? "Small but mighty". How does the text correlate to the context of the image? The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar.

📄 License

This project is licensed under the apache-2.0 license.

Model is trained using a modified version from Bunny

Training Data

Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one.

Finetuning Code

Coming Soon!!!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご