๐ Heron-NVILA-Lite-2B
Heron-NVILA-Lite-2B is a vision language model trained for Japanese, based on the NVILA-Lite architecture. It enables multimodal interactions, supporting both image and text inputs.
โจ Features
- Multilingual Support: Supports both Japanese and English.
- Multimodal Capability: Based on the NVILA-Lite architecture, it can handle both image and text inputs.
๐ฆ Installation
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git
๐ป Usage Examples
Basic Usage
from transformers import AutoConfig, AutoModel
model_path = "turing-motors/Heron-NVILA-Lite-2B"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
print(model.tokenizer.chat_template)
response = model.generate_content(["ใใใซใกใฏ"])
print(response)
print("---" * 40)
Advanced Usage
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"])
print(response)
print("---" * 40)
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
[image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"],
generation_config=generation_config
)
print(response)
print("---" * 40)
from PIL import Image
import requests
url_list = [
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
images[0],
"ใใใฏๆฅๆฌใฎ็ปๅใงใ",
images[1],
"ใใใฏใชใผในใใชใขใฎ็ปๅใงใ",
"ๅ็ปๅใฎ้ใใ่ชฌๆใใฆ"])
print(response)
print("---" * 40)
๐ Documentation
Model Overview
Training Summary
Stage |
Training |
Data Sources |
Samples |
Stage1 |
Projector |
Japanese image text pairs, LLaVA-Pretrain |
1.1M |
Stage2 |
Projector, LLM |
Filtered MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) |
13M |
|
|
Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions |
20M |
Stage3 |
Vision Encoder, Projector, LLM |
llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock |
1.1M |
Evaluation
I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.
๐ง Technical Details
The model is based on the NVILA-Lite architecture, which combines a vision encoder, a projector, and a large language model (LLM). The vision encoder extracts features from images, the projector maps these features to the LLM's input space, and the LLM generates responses based on the input text and image features.
๐ License
โ ๏ธ Important Note
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.
๐ก Usage Tip
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO). And it acknowledges the use of the following open - source repositories: VILA and llm-jp-eval-mm.