Llama-3.1-Nemotron-Nano-VL-8B-V1 Open Source Model - Summary of Free Deployment, Query, Text, Images, and Videos

Llama 3.1 Nemotron Nano VL 8B V1

Developed by nvidia

Llama-3.1-Nemotron-Nano-VL-8B-V1 is an advanced document intelligent vision-language model that can query and summarize images and videos, and supports multi-environment deployment.

Image-to-Text

Transformers

Open Source License:Other #Document intelligent analysis #Multi-image comparison reasoning #Edge device deployment

Downloads 1,092

Release Time : 6/3/2025

Model Overview

This model is a leading document intelligent vision-language model that can query and summarize images and videos in the real or virtual world. It supports deployment in multiple environments such as data centers, clouds, and edge devices, and is widely used in multiple fields such as image analysis and question answering.

Model Features

Powerful document intelligence

It can query and summarize images and videos, and supports multi-modal input and output.

Multi-environment deployment

It can be deployed on data centers, clouds, and edge devices (such as Jetson Orin and laptops), and supports AWQ 4-bit quantization and the TinyChat framework.

Multi-modal support

It supports input of images, videos, and text, and the output is text, suitable for various tasks.

Model Capabilities

Image analysis

Video summarization

Text generation

Multi-image comparison

Optical character recognition

Interactive question answering

Use Cases

Document intelligence

Image summarization

Summarize and describe the content of single or multiple images.

Text-image analysis

Conduct comprehensive analysis by combining text and images, and generate detailed descriptions or answer relevant questions.

Visual question answering

Interactive image question answering

Answer questions raised by users based on the image content.

Multi-image comparison and contrast

Compare the similarities and differences of multiple images and generate comparative analysis results.

🚀 Llama-3.1-Nemotron-Nano-VL-8B-V1

Llama-3.1-Nemotron-Nano-VL-8B-V1 is a leading document intelligence vision language model that enables querying and summarizing images and videos.

🚀 Quick Start

📦 Install Dependencies

pip install transformers accelerate timm einops open-clip-torch

💻 Usage Examples

🔍 Basic Usage

from PIL import Image
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

path = "nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1"
model = AutoModel.from_pretrained(path, trust_remote_code=True, device_map="cuda").eval()
tokenizer = AutoTokenizer.from_pretrained(path)
image_processor = AutoImageProcessor.from_pretrained(path, trust_remote_code=True, device="cuda")

image1 = Image.open("images/example1a.jpeg")
image2 = Image.open("images/example1b.jpeg")
image_features = image_processor([image1, image2])

generation_config = dict(max_new_tokens=1024, do_sample=False, eos_token_id=tokenizer.eos_token_id)

question = 'Describe the two images.'
response = model.chat(
    tokenizer=tokenizer, question=question, generation_config=generation_config,
    **image_features)

print(f'User: {question}\nAssistant: {response}')

✨ Features

Llama Nemotron Nano VL is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. It is deployable in the data center, cloud and at the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. Key findings during development include: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3) re-blending text-only instruction data is crucial to boost both VLM and text-only performance.

This model was trained on commercial images and videos for all three stages of training and supports single image and video inference.

📚 Documentation

Model Overview

Description

Llama Nemotron Nano VL is a cutting - edge document intelligence vision language model (VLMs). It allows users to query and summarize images and videos from the real or virtual world. It can be deployed in data centers, the cloud, and at the edge, such as on Jetson Orin and laptops via AWQ 4 - bit quantization through the TinyChat framework.

This model was trained on commercial images and videos across all three training stages and supports single - image and video inference.

License/Terms of Use

Governing Terms: Your use of the model is governed by the NVIDIA Open License Agreement. Additional Information: Llama 3.1 Community Model License; Built with Llama.

Deployment Geography

Global

Use Case

Customers: AI foundry enterprise customers Use Cases: Image summarization, text - image analysis, Optical Character Recognition, Interactive Q&A on images, Comparison and contrast of multiple images, Text Chain - of - Thought reasoning.

Release Date

Build.Nvidia.com [June 3rd, 2025] via nvidia/llama-3.1-nemotron-nano-vl-8b-v1
Hugging Face [June 3rd, 2025]

Model Architecture

Property	Details
Network Type	Transformer
Network Architecture	Vision Encoder: CRadioV2 - H; Language Encoder: Llama - 3.1 - 8B - Instruct

Input

Input Type(s): Image, Video, Text
Input Images Supported: Multiple images within 16K input + output tokens
Language Supported: English only
Input Format(s): Image (Red, Green, Blue (RGB)), Video (.mp4), and Text (String)
Input Parameters: Image (2D), Video (3D), Text (1D)
Other Properties Related to Input:
- Input + Output Token: 16K
- Maximum Resolution: Determined by a 12 - tile layout constraint, with each tile being 512 × 512 pixels. Supported aspect ratios include 4 × 3 (up to 2048 × 1536 pixels), 3 × 4 (up to 1536 × 2048 pixels), 2 × 6 (up to 1024 × 3072 pixels), 6 × 2 (up to 3072 × 1024 pixels). Other configurations are allowed as long as the total tiles ≤ 12.
- Channel Count: 3 channels (RGB)
- Alpha Channel: Not supported (no transparency)

Output

Output Type(s): Text
Output Formats: String
Output Parameters: 1D
Other Properties Related to Output: Input + Output Token: 16K

The model is designed and/or optimized to run on NVIDIA GPU - accelerated systems. By leveraging NVIDIA’s hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), it achieves faster training and inference times compared to CPU - only solutions.

Software Integration

Runtime Engine(s): TensorRT - LLM
Supported Hardware Microarchitecture Compatibility: H100 SXM 80GB
Supported Operating System(s): Linux

Model Versions

Llama - 3.1 - Nemotron - Nano - VL - 8B - V1

Training/Evaluation Dataset

NV - Pretraining and NV - CosmosNemotron - SFT were used for training and evaluation.

Data Collection Method by dataset (Training and Evaluation): Hybrid: Human, Synthetic Labeling Method by dataset (Training and Evaluation): Hybrid: Human, Synthetic

The dataset collection (for training and evaluation) consists of a mix of internal and public datasets for various tasks, including:

Internal datasets built with public commercial images and internal labels for tasks like conversation modeling and document analysis.
Public datasets sourced from publicly available images and annotations for tasks such as image captioning and visual question answering.
Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
Specialized datasets for safety alignment, function calling, and domain - specific tasks (e.g., science diagrams, financial question answering).

Evaluation Benchmarks

Benchmark	Score
MMMU Val with chatGPT as a judge	48.2%
AI2D	85.0%
ChartQA	86.3%
InfoVQA Val	77.4%
OCRBench	839
OCRBenchV2 English	60.1%
OCRBenchV2 Chinese	37.9%
DocVQA val	91.2%
VideoMME	54.7%

Inference

Engine: TTensorRT - LLM
Test Hardware: 1x NVIDIA H100 SXM 80GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices for AI development. When developers download or use the model according to the terms of service, they should work with their internal model team to ensure the model meets industry requirements and addresses potential misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Users are responsible for model inputs and outputs and for ensuring safe integration of the model, including implementing guardrails and other safety mechanisms before deployment.

Outputs generated by these models may contain political content, potentially misleading information, content security and safety issues, or unwanted bias.

📄 License

Your use of the model is governed by the NVIDIA Open License Agreement. Additional Information: Llama 3.1 Community Model License; Built with Llama.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご