license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE
language:
- ja
- en
tags:
- vila
- nvila
- conversational
- multimodal
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
- Efficient-Large-Model/paligemma-siglip-so400m-patch14-448
pipeline_tag: image-text-to-text
Heron-NVILA-Lite-1B
Heron-NVILA-Lite-1B is a vision language model trained for Japanese, based on the NVILA-Lite architecture.
Model Overview
Setup
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git
Usage
from transformers import AutoConfig, AutoModel
model_path = "turing-motors/Heron-NVILA-Lite-1B"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
print(model.tokenizer.chat_template)
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "画像を説明してください。"])
print(response)
print("---" * 40)
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
[image, "画像を説明してください。"],
generation_config=generation_config
)
print(response)
print("---" * 40)
from PIL import Image
import requests
url_list = [
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
images[0],
"これは日本の画像です",
images[1],
"これはオーストリアの画像です",
"各画像の違いを説明して"])
print(response)
print("---" * 40)
Training Summary
Stage |
Training |
Data Sources |
Samples |
Stage1 |
Projector |
Japanese image text pairs, LLaVA-Pretrain |
1.1M |
Stage2 |
Projector, LLM |
Filtered MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) |
13M |
|
|
Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions |
20M |
Stage3 |
Vision Encoder, Projector, LLM |
llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock |
1.1M |
Evaluation
I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.
Risks and Limitations
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.
License
Acknowledgements
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
I would like to acknowledge the use of the following open-source repositories: