nanoLLaVAオープンソース視覚言語モデル - エッジデバイスに特化して作られ、高効率で動作可能！

ホーム

Nanollava

qnguyen3によって開発

nanoLLaVAは1Bパラメータの視覚言語モデルで、エッジデバイス向けに設計され、効率的な動作が特徴です。

テキスト生成画像

Transformers

英語オープンソースライセンス:Apache-2.0 #エッジデバイス向け視覚質問応答 #軽量マルチモーダル #効率的な視覚言語モデル

ダウンロード数 2,851

リリース時間 : 4/4/2024

モデル概要

nanoLLaVAは小型ながら強力な視覚言語モデルで、Qwen1.5-0.5BとSigLIP視覚エンコーダーを基に構築され、マルチモーダルタスクに適しています。

モデル特徴

効率的なエッジコンピューティング

エッジデバイス上で効率的に動作するよう設計されており、パラメータ規模は小さいながらも強力な性能を発揮します。

マルチモーダル能力

視覚と言語の理解能力を組み合わせ、画像とテキストの共同タスクを処理できます。

改良版

nanoLLaVA-1.5バージョンがリリースされ、性能が大幅に向上しました。

モデル能力

視覚質問応答

画像説明生成

マルチモーダル理解

テキスト生成

画像分析

使用事例

スマートアシスタント

画像内容の説明

ユーザーが提供した画像に基づいて詳細な説明を生成

画像内の内容と文脈関係を正確に識別可能

教育

科学問題の解答

画像に関連する科学質問に回答

ScienceQAデータセットで58.97%の精度を達成

🚀 nanoLLaVA - 10億パラメータ未満の視覚言語モデル

nanoLLaVAは、エッジデバイスで効率的に動作するように設計された「小さいながらも強力な」10億パラメータの視覚言語モデルです。

Logo

🚀 クイックスタート

重要なお知らせ

nanoLLaVA-1.5 が大幅に性能向上してリリースされました。こちらから確認できます。

インストール

pip install -U transformers accelerate flash_attn

使用方法

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# いくつかの警告を無効化
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# デバイスを設定
torch.set_default_device('cuda')  # または 'cpu'

# モデルを作成
model = AutoModelForCausalLM.from_pretrained(
    'qnguyen3/nanoLLaVA',
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'qnguyen3/nanoLLaVA',
    trust_remote_code=True)

# テキストプロンプト
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# 画像、サンプル画像はimagesフォルダにあります
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# 生成
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

✨ 主な機能

モデル構成

ベースLLM：Quyen-SE-v0.1 (Qwen1.5-0.5B)
視覚エンコーダ：google/siglip-so400m-patch14-384

評価スコア

モデル	VQA v2	TextVQA	ScienceQA	POPE	MMMU (Test)	MMMU (Eval)	GQA	MM-VET
スコア	70.84	46.71	58.97	84.1	28.6	30.4	54.79	23.9

📦 インストール

pip install -U transformers accelerate flash_attn

💻 使用例

基本的な使用法

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# いくつかの警告を無効化
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# デバイスを設定
torch.set_default_device('cuda')  # または 'cpu'

# モデルを作成
model = AutoModelForCausalLM.from_pretrained(
    'qnguyen3/nanoLLaVA',
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'qnguyen3/nanoLLaVA',
    trust_remote_code=True)

# テキストプロンプト
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# 画像、サンプル画像はimagesフォルダにあります
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# 生成
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

📚 ドキュメント

プロンプトフォーマット

モデルはChatML標準に従っていますが、<|im_end|> の末尾に \n はありません。

<|im_start|>system
Answer the question<|im_end|><|im_start|>user
<image>
What is the picture about?<|im_end|><|im_start|>assistant

画像と例

画像	例
	What is the text saying? "Small but mighty". How does the text correlate to the context of the image? The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar.