Heron NVILA Lite 15B

Developed by turing-motors

Heron-NVILA-Lite-15B is a vision-language model based on the NVILA-Lite architecture, specifically trained for Japanese, supporting both Japanese and English with image-text understanding and generation capabilities.

Image-to-Text

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese Multimodal Dialogue #Visual-Language Joint Reasoning #High-Precision Image-Text Understanding

Downloads 936

Release Time : 3/23/2025

Model Overview

This model is a multimodal vision-language model capable of processing image and text inputs to generate text outputs. Primarily used for Japanese and English image-text dialogue, image captioning, and similar tasks.

Model Features

Multimodal Capability

Can process both image and text inputs simultaneously for image-text interaction

Japanese Optimization

Specifically trained and optimized for Japanese

Efficient Architecture

Utilizes the NVILA-Lite architecture to balance performance and efficiency

Multi-Stage Training

Undergoes a three-stage training process to enhance model performance

Model Capabilities

Image Understanding

Text Generation

Image-Text Dialogue

Multilingual Support

Multi-Image Alternating Understanding

Use Cases

Image Understanding

Image Captioning

Generates descriptive text based on input images

Can accurately describe image content

Visual Question Answering

Image QA

Answers questions about image content

Achieved a score of 3.82/5 in evaluations

Multimodal Dialogue

Alternating Image-Text Dialogue

Handles complex dialogues involving multiple images and texts

Can understand context and generate coherent responses

license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct/blob/main/LICENSE language:

ja
en tags:
vila
nvila
conversational
multimodal base_model:
Qwen/Qwen2.5-14B-Instruct
Efficient-Large-Model/paligemma-siglip-so400m-patch14-448 pipeline_tag: image-text-to-text

Heron-NVILA-Lite-15B

Heron-NVILA-Lite-15B is a vision language model trained for Japanese, based on the NVILA-Lite architecture.

Model Overview

Developer: Turing Inc.
Vision Encoder: paligemma-siglip-so400m-patch14-448
Projector: mlp_downsample_3x3_fix
LLM: Qwen2.5-14B-Instruct
Supported Languages: Japanese, English

Setup

# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

Usage

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-15B"

# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# show chat_template
print(model.tokenizer.chat_template)

# examples generate with raw text
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "画像を説明してください。"])
print(response)
print("---" * 40)

# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "画像を説明してください。"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の画像です",
    images[1],
    "これはオーストリアの画像です",
    "各画像の違いを説明して"])
print(response)
print("---" * 40)

Training Summary

Stage	Training	Data Sources	Samples
Stage1	Projector	Japanese image text pairs, LLaVA-Pretrain	1.1M
Stage2	Projector, LLM	Filtered MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05)	13M
		Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions	20M
Stage3	Vision Encoder, Projector, LLM	llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock	1.1M

Evaluation

I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.

Model	LLM Size	Heron-Bench overall LLM (%)	JA-VLM-Bench-In-the-Wild LLM (/5.0)	JA-VG-VQA-500 LLM (/5.0)
Heron-NVILA-Lite-1B	0.5B	45.9	2.92	3.16
Heron-NVILA-Lite-2B	1.5B	52.8	3.52	3.50
Heron-NVILA-Lite-15B	14B	59.6	4.2	3.82
LLaVA-CALM2-SigLIP	7B	43.3	3.15	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	2.92	2.96
VILA-jp	13B	57.2	3.69	3.62
Asagi-14B	13B	55.8	3.44	3.84
Sarashina2-Vision-14B	13B	50.9	4.1	3.43
Qwen2-VL 7B Instruct	7B	55.5	3.61	3.6
GPT-4o	-	87.6	3.85	3.58

Risks and Limitations

This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.

License

Model weights are licensed under Apache License 2.0.
Users must comply with OpenAI terms of use due to the inclusion of GPT-4-generated synthetic data.

Acknowledgements

This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

I would like to acknowledge the use of the following open-source repositories:

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご