nanoLLaVA-1.5 Open-Source Vision-Language Model - Compact and Powerful, Suitable for Free Deployment on Edge Devices

Nanollava 1.5

Developed by qnguyen3

nanoLLaVA-1.5 is a vision-language model with under 1 billion parameters, designed specifically for edge devices—compact yet powerful.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Edge Device Vision-Language #Lightweight Multimodal #Efficient Visual Question Answering

Downloads 442

Release Time : 6/29/2024

Model Overview

nanoLLaVA-1.5 is an upgrade from v1.0, an efficient vision-language model suitable for image-text-to-text tasks.

Model Features

Compact yet Powerful

Designed for edge devices with under 1 billion parameters, yet highly capable.

Multimodal Support

Supports multimodal tasks involving vision and language.

Efficient Inference

Optimized to run efficiently even on edge devices.

Model Capabilities

Image caption generation

Visual question answering

Multimodal reasoning

Use Cases

Visual Question Answering

Image content description

Generate detailed textual descriptions based on images.

Education

Scientific question answering

Answer scientific questions based on images.

🚀 nanoLLaVA-1.5 - Improved sub 1B Vision-Language Model

nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices.

Logo

🚀 Quick Start

nanoLLaVA-1.5 is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. This is an update from the v1.0 version qnguyen3/nanoLLaVA.

✨ Features

Base LLM: Quyen-SE-v0.1 (Qwen1.5 - 0.5B)
Vision Encoder: google/siglip-so400m-patch14-384

Model	VQA v2	TextVQA	ScienceQA	POPE	MMMU (Test)	MMMU (Eval)	GQA	MM-VET
nanoLLavA-1.0	70.84	46.71	58.97	84.1	28.6	30.4	54.79	23.9
nanoLLavA-1.5	TBD	TBD	TBD	TBD	TBD	TBD	TBD	TBD

📦 Installation

You can use with transformers with the following script:

pip install -U transformers accelerate flash_attn

💻 Usage Examples

Basic Usage

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

model_name = 'qnguyen3/nanoLLaVA-1.5'

# create model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

📚 Documentation

Prompt Format

The model follow the ChatML standard, however, without \n at the end of <|im_end|>:

<|im_start|>system
Answer the question<|im_end|><|im_start|>user
<image>
What is the picture about?<|im_end|><|im_start|>assistant

Model is trained using a modified version from Bunny

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご