Heron-NVILA-Lite-2B開源視覺語言模型 - 免費支持日英雙語圖文交互

首頁

Heron NVILA Lite 2B

由turing-motors開發

Heron-NVILA-Lite-2B 是一款基於 NVILA-Lite 架構、專為日語訓練的視覺語言模型，支持日語和英語的圖文交互任務。

圖像生成文本

Safetensors

支持多種語言開源協議:Apache-2.0 #日語視覺對話 #多模態輕量級 #圖文指令理解

下載量 1,023

發布時間 : 3/21/2025

模型概述

該模型結合了視覺編碼器和大型語言模型，能夠處理圖像和文本的聯合任務，如圖像描述生成、視覺問答等。

模型特點

多語言支持

專門針對日語優化，同時支持英語的視覺語言任務

高效架構

採用 NVILA-Lite 輕量級架構，平衡性能和效率

多模態理解

能夠同時處理圖像和文本輸入，理解兩者之間的關係

模型能力

圖像描述生成

視覺問答

多圖交替對話

多語言文本生成

使用案例

內容理解

圖像描述

為輸入的圖像生成詳細的文字描述

能夠準確描述圖像中的主要內容和場景

智能交互

視覺問答

回答關於圖像內容的自然語言問題

能夠理解圖像內容並給出相關回答

多輪對話

多圖對比

分析多張圖像的異同點

能夠比較不同圖像的特徵並指出差異

🚀 Heron-NVILA-Lite-2B

Heron-NVILA-Lite-2B是一個基於NVILA-Lite架構、為日語訓練的視覺語言模型。它能夠處理圖像和文本輸入，輸出相應的文本內容，在多模態交互場景中具有重要價值。

✨ 主要特性

多語言支持：支持日語和英語兩種語言。
多模態處理：能夠處理圖像和文本的輸入，實現圖像與文本的交互。

📦 安裝指南

# 我已確認4.46.0和4.49.0版本也可以使用。其他版本的Transformer可能也能使用，但我尚未進行測試。
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

💻 使用示例

基礎用法

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-2B"

# 你可以使用配置
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# 或者直接使用from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# 顯示聊天模板
print(model.tokenizer.chat_template)

# 僅使用文本生成示例
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

高級用法

文本 + 圖像生成示例

from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "畫像を説明してください。"])
print(response)
print("---" * 40)

使用生成配置進行生成示例

from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "畫像を説明してください。"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

多圖像 + 文本生成示例

from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の畫像です",
    images[1],
    "これはオーストリアの畫像です",
    "各畫像の違いを説明して"])
print(response)
print("---" * 40)

📚 詳細文檔

模型概述

屬性	詳情
開發者	Turing Inc.
視覺編碼器	paligemma-siglip-so400m-patch14-448
投影器	mlp_downsample_2x2_fix
大語言模型	Qwen2.5-1.5B-Instruct
支持語言	日語、英語

訓練總結

階段	訓練內容	數據來源	樣本數量
階段1	投影器	日語圖像文本對，LLaVA-Pretrain	110萬
階段2	投影器、大語言模型	過濾後的 MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) 日語圖像文本對（子集），日語交錯數據（子集），mmc4-core（子集），coyo-700m（子集），wikipedia_ja，llava_pretrain_ja，stair_captions	1300萬 2000萬
階段3	視覺編碼器、投影器、大語言模型	llava-instruct-v1_5-en-subset-358k，llava-instruct-ja，japanese-photos-conv，ja-vg-vqa，synthdog-ja（子集），ai2d，synthdog-en，sherlock	110萬

評估

使用 llm-jp-eval-mm 進行評估。除 Heron-NVILA-Lite 和 Sarashina2-Vision-14B 之外的模型分數取自 2025 年 3 月的 llm-jp-eval-mm 排行榜和 Asagi 網站。Heron-NVILA-Lite 和 Sarashina2-Vision-14B 使用 "gpt-4o-2024-05-13" 作為評估模型進行評估。Sarashina2-Vision-14B 在官方博客上使用 "gpt-4o-2024-08-06" 進行評估；請注意，由於評估條件不同，Sarashina2-Vision-14B 的結果僅作參考。

模型	大語言模型規模	Heron-Bench 整體大語言模型得分（%）	JA-VLM-Bench-In-the-Wild 大語言模型得分（滿分 5.0）	JA-VG-VQA-500 大語言模型得分（滿分 5.0）
Heron-NVILA-Lite-1B	0.5B	45.9	2.92	3.16
Heron-NVILA-Lite-2B	1.5B	52.8	3.52	3.50
Heron-NVILA-Lite-15B	14B	59.6	4.2	3.82
LLaVA-CALM2-SigLIP	7B	43.3	3.15	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	2.92	2.96
VILA-jp	13B	57.2	3.69	3.62
Asagi-14B	13B	55.8	3.44	3.84
Sarashina2-Vision-14B	13B	50.9	4.1	3.43
Qwen2-VL 7B Instruct	7B	55.5	3.61	3.6
GPT-4o	-	87.6	3.85	3.58