Heron-NVILA-Lite-33B Open-Source Visual Language Model - Support for Multimodal Tasks in Japanese and English for Easy Application

Heron NVILA Lite 33B

Developed by turing-motors

Heron-NVILA-Lite-33B is a vision-language model based on the NVILA-Lite architecture, specifically trained for Japanese, and supports multimodal tasks in both Japanese and English.

Image-to-Text

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese visual dialogue #Multimodal large model #High-precision image-text understanding

Downloads 99

Release Time : 5/12/2025

Model Overview

This model combines a visual encoder and a large language model, capable of handling image-text interaction tasks, and is particularly optimized for performance in the Japanese environment.

Model Features

Japanese optimization

Specifically trained for the Japanese environment, performs excellently in Japanese vision-language tasks

Multimodal capabilities

Can handle both image and text inputs simultaneously to achieve image-text interaction

High-performance architecture

Combines an advanced visual encoder and a large language model to provide powerful inference capabilities

Model Capabilities

Image description generation

Visual question answering

Multi-round image-text dialogue

Cross-lingual understanding

Image content analysis

Use Cases

Content understanding

Image description generation

Generate detailed text descriptions for the input images

Scored 3.85/5.0 in the Japanese visual question answering 500 test

Customer service

Multi-round image-text dialogue

Supports multi-round dialogue interaction based on images

Scored 4.0/5.0 in the Japanese VLM wild benchmark test

🚀 Heron-NVILA-Lite-33B

Heron-NVILA-Lite-33B is a vision language model trained for Japanese, based on the NVILA-Lite architecture. It enables multimodal interactions, supporting both Japanese and English languages.

🚀 Quick Start

📦 Installation

# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

💻 Usage Examples

🔍 Basic Usage

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-33B"

# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# show chat_template
print(model.tokenizer.chat_template)

# examples generate with raw text
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

⚙️ Advanced Usage

# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "画像を説明してください。"])
print(response)
print("---" * 40)

# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "画像を説明してください。"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の画像です",
    images[1],
    "これはオーストリアの画像です",
    "各画像の違いを説明して"])
print(response)
print("---" * 40)

✨ Features

Multilingual Support: Supports both Japanese and English, facilitating cross - language multimodal interactions.
Multimodal Capability: Integrates vision and language, enabling tasks such as image - text generation and description.

📚 Documentation

📋 Model Overview

Property	Details
Developer	Turing Inc.
Vision Encoder	siglip2-so400m-patch16-512
Projector	mlp_downsample_2x2_fix
LLM	Qwen2.5-32B-Instruct
Supported Languages	Japanese, English

📈 Training Summary

Stage	Training	Data Sources	Samples
Stage1	Projector	Japanese image text pairs, LLaVA-Pretrain	1.1M
Stage2	Projector, LLM	Filtered MOMIJI (CC-MAIN-2024-42)	3M
		Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions	20M
Stage3	Vision Encoder, Projector, LLM	llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock	1.1M

📊 Evaluation

I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.

Model	LLM Size	Heron-Bench overall LLM (%)	JA-VLM-Bench-In-the-Wild LLM (/5.0)	JA-VG-VQA-500 LLM (/5.0)
Heron-NVILA-Lite-1B	0.5B	45.9	2.92	3.16
Heron-NVILA-Lite-2B	1.5B	52.8	3.52	3.50
Heron-NVILA-Lite-15B	14B	59.6	4.2	3.82
Heron-NVILA-Lite-33B	32B	61.1	4.0	3.85
LLaVA-CALM2-SigLIP	7B	43.3	3.15	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	2.92	2.96
VILA-jp	13B	57.2	3.69	3.62
Asagi-14B	13B	55.8	3.44	3.84
Sarashina2-Vision-14B	13B	50.9	4.1	3.43
Qwen2-VL 7B Instruct	7B	55.5	3.61	3.6
GPT-4o	-	87.6	3.85	3.58

🔧 Technical Details

This model is based on the NVILA - Lite architecture, which combines a vision encoder, a projector, and a large - language model (LLM). The vision encoder extracts features from images, the projector maps these features to a suitable space, and the LLM generates text responses based on the input image features and text prompts.

📄 License

Model weights are licensed under Apache License 2.0.
Users must comply with OpenAI terms of use due to the inclusion of GPT - 4 - generated synthetic data.

⚠️ Important Note

This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.

💡 Usage Tip

When using the model, ensure that the input image format is compatible with the requirements of the vision encoder. Also, adjust the generation configuration parameters according to your specific needs to obtain better results.

🙏 Acknowledgements

This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

I would like to acknowledge the use of the following open - source repositories:

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご