nanoLLaVA開源視覺語言模型 - 專為邊緣設備打造，可高效運行！

首頁

Nanollava

由qnguyen3開發

nanoLLaVA是一款1B參數的視覺語言模型，專為邊緣設備設計，具有高效運行的特點。

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #邊緣設備視覺問答 #輕量級多模態 #高效視覺語言模型

下載量 2,851

發布時間 : 4/4/2024

模型概述

nanoLLaVA是一個小型但功能強大的視覺語言模型，基於Qwen1.5-0.5B和SigLIP視覺編碼器構建，適用於多模態任務。

模型特點

高效邊緣計算

專為在邊緣設備上高效運行而設計，參數規模小但性能強大。

多模態能力

結合視覺和語言理解能力，可處理圖像和文本的聯合任務。

改進版本

nanoLLaVA-1.5版本已發佈，性能大幅提升。

模型能力

視覺問答

圖像描述生成

多模態理解

文本生成

圖像分析

使用案例

智能助手

圖像內容描述

根據用戶提供的圖像生成詳細描述

能準確識別圖像中的內容和上下文關係

教育

科學問題解答

回答與圖像相關的科學問題

在ScienceQA數據集上達到58.97%準確率

🚀 nanoLLaVA - 小於10億參數的視覺語言模型

nanoLLaVA是一款“小而強大”的10億參數以下視覺語言模型，專為在邊緣設備上高效運行而設計。它能在有限資源下實現出色的視覺與語言處理能力，為邊緣計算場景提供有力支持。

🚀 快速開始

你可以使用transformers庫，通過以下腳本使用該模型：

pip install -U transformers accelerate flash_attn

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

# create model
model = AutoModelForCausalLM.from_pretrained(
    'qnguyen3/nanoLLaVA',
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'qnguyen3/nanoLLaVA',
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

✨ 主要特性

輕量級高效運行：作為10億參數以下的模型，能夠在邊緣設備上高效運行，降低了對硬件資源的要求。
多數據集表現出色：在多個視覺問答和多模態任務數據集上取得了不錯的成績，如在VQA v2中得分70.84，POPE中得分84.1等。

📦 安裝指南

使用以下命令安裝所需依賴：

pip install -U transformers accelerate flash_attn

📚 詳細文檔

模型信息

屬性	詳情
基礎大語言模型	Quyen-SE-v0.1 (Qwen1.5-0.5B)
視覺編碼器	google/siglip-so400m-patch14-384

模型性能

模型	VQA v2	TextVQA	ScienceQA	POPE	MMMU (測試集)	MMMU (評估集)	GQA	MM-VET
得分	70.84	46.71	58.97	84.1	28.6	30.4	54.79	23.9

訓練數據

訓練數據將在後續發佈，因為作者仍在撰寫相關論文。預計最終版本會比當前版本更強大。

微調代碼

即將推出！

提示格式

該模型遵循ChatML標準，但<|im_end|>末尾沒有\n：

<|im_start|>system
Answer the question<|im_end|><|im_start|>user
<image>
What is the picture about?<|im_end|><|im_start|>assistant

圖片	示例
	圖片中的文字說了什麼？ “Small but mighty”。文字與圖片上下文有什麼關聯？文字似乎是對一個小而強大的形象（可能是老鼠或老鼠玩具）拿著槓鈴的一種有趣或幽默的表達。