Heron-NVILA-Lite-1Bオープンソースモデル - 日英バイリンガル対応の画像とテキストのインタラクション機能

ホーム

Heron NVILA Lite 1B

turing-motorsによって開発

NVILA-Liteアーキテクチャでトレーニングされた日本語視覚言語モデル、日本語と英語の画像テキストインタラクションをサポート

画像生成テキスト

Safetensors

複数言語対応オープンソースライセンス:Apache-2.0 #日本語視覚質問応答 #軽量マルチモーダル #対話型AI

ダウンロード数 460

リリース時間 : 3/24/2025

モデル概要

Heron-NVILA-Lite-1Bは軽量な視覚言語モデルで、画像とテキスト入力を処理し、自然言語の応答を生成できます。特に日本語シーン向けに最適化されており、英語もサポートしています。

モデル特徴

軽量アーキテクチャ

効率的な1Bパラメータ設計を採用し、性能と計算リソースのバランスを取っています

マルチモーダル理解

画像とテキスト入力を同時に処理し、両者の関係を理解できます

日本語最適化

特に日本語シーン向けにトレーニングと最適化が行われています

対話型インタラクション

複数ターンの画像テキスト対話をサポートし、コンテキストの一貫性を保ちます

モデル能力

画像キャプション生成

視覚質問応答

マルチモーダル対話

クロスランゲージ理解

画像内容比較

使用事例

インテリジェントカスタマーサポート

製品画像相談

ユーザーが製品画像をアップロードし、製品情報と購入アドバイスを取得

教育支援

視覚的学習

教材画像に基づいて説明文を生成

コンテンツモデレーション

画像内容分析

画像内のセンシティブなコンテンツを識別・記述

🚀 Heron-NVILA-Lite-1B

Heron-NVILA-Lite-1Bは、NVILA-Liteアーキテクチャに基づいて、日本語向けに学習されたビジョン言語モデルです。このモデルは、画像とテキストを組み合わせた多様なタスクに対応しており、日本語と英語の両方をサポートしています。

🚀 クイックスタート

このモデルを使用する前に、必要なライブラリをインストールする必要があります。以下のコマンドを実行してください。

# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

✨ 主な機能

多言語対応：日本語と英語をサポートしています。
マルチモーダル処理：画像とテキストを組み合わせたタスクに対応しています。
高性能：評価結果によると、他のモデルと比較して良好な性能を示しています。

📦 インストール

必要なライブラリをインストールするには、以下のコマンドを実行してください。

# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

💻 使用例

基本的な使用法

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-1B"

# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# show chat_template
print(model.tokenizer.chat_template)

# examples generate with raw text
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

高度な使用法

# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "画像を説明してください。"])
print(response)
print("---" * 40)

# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "画像を説明してください。"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の画像です",
    images[1],
    "これはオーストリアの画像です",
    "各画像の違いを説明して"])
print(response)
print("---" * 40)

📚 ドキュメント

モデル概要

属性	詳情
開発者	Turing Inc.
ビジョンエンコーダ	paligemma-siglip-so400m-patch14-448
プロジェクター	mlp_downsample_2x2_fix
LLM	Qwen2.5-0.5B-Instruct
サポート言語	日本語、英語

学習概要

ステージ	学習内容	データソース	サンプル数
ステージ1	プロジェクター	Japanese image text pairs, LLaVA-Pretrain	1.1M
ステージ2	プロジェクター、LLM	Filtered MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05)	13M
		Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions	20M
ステージ3	ビジョンエンコーダ、プロジェクター、LLM	llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock	1.1M

評価

この評価では、llm-jp-eval-mmを使用しました。Heron-NVILA-LiteおよびSarashina2-Vision-14B以外のモデルのスコアは、2025年3月時点のllm-jp-eval-mm leaderboardとAsagi websiteから取得しました。Heron-NVILA-LiteとSarashina2-Vision-14Bは、"gpt-4o-2024-05-13"を使用したllm-as-a-judgeで評価されました。Sarashina2-Vision-14Bは、公式ブログで"gpt-4o-2024-08-06"を使用して評価されています。評価条件が異なるため、Sarashina2-Vision-14Bの結果は参考程度に捉えてください。

モデル	LLMサイズ	Heron-Bench全体LLM (%)	JA-VLM-Bench-In-the-Wild LLM (/5.0)	JA-VG-VQA-500 LLM (/5.0)
Heron-NVILA-Lite-1B	0.5B	45.9	2.92	3.16
Heron-NVILA-Lite-2B	1.5B	52.8	3.52	3.50
Heron-NVILA-Lite-15B	14B	59.6	4.2	3.82
LLaVA-CALM2-SigLIP	7B	43.3	3.15	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	2.92	2.96
VILA-jp	13B	57.2	3.69	3.62
Asagi-14B	13B	55.8	3.44	3.84
Sarashina2-Vision-14B	13B	50.9	4.1	3.43
Qwen2-VL 7B Instruct	7B	55.5	3.61	3.6
GPT-4o	-	87.6	3.85	3.58