๐ Heron-NVILA-Lite-33B
Heron-NVILA-Lite-33B is a vision language model trained for Japanese, based on the NVILA-Lite architecture. It enables multimodal interactions, supporting both Japanese and English languages.
๐ Quick Start
๐ฆ Installation
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git
๐ป Usage Examples
๐ Basic Usage
from transformers import AutoConfig, AutoModel
model_path = "turing-motors/Heron-NVILA-Lite-33B"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
print(model.tokenizer.chat_template)
response = model.generate_content(["ใใใซใกใฏ"])
print(response)
print("---" * 40)
โ๏ธ Advanced Usage
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"])
print(response)
print("---" * 40)
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
[image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"],
generation_config=generation_config
)
print(response)
print("---" * 40)
from PIL import Image
import requests
url_list = [
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
images[0],
"ใใใฏๆฅๆฌใฎ็ปๅใงใ",
images[1],
"ใใใฏใชใผในใใชใขใฎ็ปๅใงใ",
"ๅ็ปๅใฎ้ใใ่ชฌๆใใฆ"])
print(response)
print("---" * 40)
โจ Features
- Multilingual Support: Supports both Japanese and English, facilitating cross - language multimodal interactions.
- Multimodal Capability: Integrates vision and language, enabling tasks such as image - text generation and description.
๐ Documentation
๐ Model Overview
๐ Training Summary
Stage |
Training |
Data Sources |
Samples |
Stage1 |
Projector |
Japanese image text pairs, LLaVA-Pretrain |
1.1M |
Stage2 |
Projector, LLM |
Filtered MOMIJI (CC-MAIN-2024-42) |
3M |
|
|
Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions |
20M |
Stage3 |
Vision Encoder, Projector, LLM |
llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock |
1.1M |
๐ Evaluation
I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.
๐ง Technical Details
This model is based on the NVILA - Lite architecture, which combines a vision encoder, a projector, and a large - language model (LLM). The vision encoder extracts features from images, the projector maps these features to a suitable space, and the LLM generates text responses based on the input image features and text prompts.
๐ License
โ ๏ธ Important Note
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.
๐ก Usage Tip
When using the model, ensure that the input image format is compatible with the requirements of the vision encoder. Also, adjust the generation configuration parameters according to your specific needs to obtain better results.
๐ Acknowledgements
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
I would like to acknowledge the use of the following open - source repositories: