🚀 Llama-3.1-Nemotron-Nano-VL-8B-V1
Llama-3.1-Nemotron-Nano-VL-8B-V1 is a leading document intelligence vision language model that enables querying and summarizing images and videos.
🚀 Quick Start
📦 Install Dependencies
pip install transformers accelerate timm einops open-clip-torch
💻 Usage Examples
🔍 Basic Usage
from PIL import Image
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
path = "nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1"
model = AutoModel.from_pretrained(path, trust_remote_code=True, device_map="cuda").eval()
tokenizer = AutoTokenizer.from_pretrained(path)
image_processor = AutoImageProcessor.from_pretrained(path, trust_remote_code=True, device="cuda")
image1 = Image.open("images/example1a.jpeg")
image2 = Image.open("images/example1b.jpeg")
image_features = image_processor([image1, image2])
generation_config = dict(max_new_tokens=1024, do_sample=False, eos_token_id=tokenizer.eos_token_id)
question = 'Describe the two images.'
response = model.chat(
tokenizer=tokenizer, question=question, generation_config=generation_config,
**image_features)
print(f'User: {question}\nAssistant: {response}')
✨ Features
Llama Nemotron Nano VL is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. It is deployable in the data center, cloud and at the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. Key findings during development include: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3) re-blending text-only instruction data is crucial to boost both VLM and text-only performance.
This model was trained on commercial images and videos for all three stages of training and supports single image and video inference.
📚 Documentation
Model Overview
Description
Llama Nemotron Nano VL is a cutting - edge document intelligence vision language model (VLMs). It allows users to query and summarize images and videos from the real or virtual world. It can be deployed in data centers, the cloud, and at the edge, such as on Jetson Orin and laptops via AWQ 4 - bit quantization through the TinyChat framework.
This model was trained on commercial images and videos across all three training stages and supports single - image and video inference.
License/Terms of Use
Governing Terms:
Your use of the model is governed by the NVIDIA Open License Agreement.
Additional Information: Llama 3.1 Community Model License; Built with Llama.
Deployment Geography
Global
Use Case
Customers: AI foundry enterprise customers
Use Cases: Image summarization, text - image analysis, Optical Character Recognition, Interactive Q&A on images, Comparison and contrast of multiple images, Text Chain - of - Thought reasoning.
Release Date
Model Architecture
Property |
Details |
Network Type |
Transformer |
Network Architecture |
Vision Encoder: CRadioV2 - H; Language Encoder: Llama - 3.1 - 8B - Instruct |
Input
- Input Type(s): Image, Video, Text
- Input Images Supported: Multiple images within 16K input + output tokens
- Language Supported: English only
- Input Format(s): Image (Red, Green, Blue (RGB)), Video (.mp4), and Text (String)
- Input Parameters: Image (2D), Video (3D), Text (1D)
- Other Properties Related to Input:
- Input + Output Token: 16K
- Maximum Resolution: Determined by a 12 - tile layout constraint, with each tile being 512 × 512 pixels. Supported aspect ratios include 4 × 3 (up to 2048 × 1536 pixels), 3 × 4 (up to 1536 × 2048 pixels), 2 × 6 (up to 1024 × 3072 pixels), 6 × 2 (up to 3072 × 1024 pixels). Other configurations are allowed as long as the total tiles ≤ 12.
- Channel Count: 3 channels (RGB)
- Alpha Channel: Not supported (no transparency)
Output
- Output Type(s): Text
- Output Formats: String
- Output Parameters: 1D
- Other Properties Related to Output: Input + Output Token: 16K
The model is designed and/or optimized to run on NVIDIA GPU - accelerated systems. By leveraging NVIDIA’s hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), it achieves faster training and inference times compared to CPU - only solutions.
Software Integration
- Runtime Engine(s): TensorRT - LLM
- Supported Hardware Microarchitecture Compatibility: H100 SXM 80GB
- Supported Operating System(s): Linux
Model Versions
Llama - 3.1 - Nemotron - Nano - VL - 8B - V1
Training/Evaluation Dataset
NV - Pretraining and NV - CosmosNemotron - SFT were used for training and evaluation.
Data Collection Method by dataset (Training and Evaluation): Hybrid: Human, Synthetic
Labeling Method by dataset (Training and Evaluation): Hybrid: Human, Synthetic
The dataset collection (for training and evaluation) consists of a mix of internal and public datasets for various tasks, including:
- Internal datasets built with public commercial images and internal labels for tasks like conversation modeling and document analysis.
- Public datasets sourced from publicly available images and annotations for tasks such as image captioning and visual question answering.
- Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
- Specialized datasets for safety alignment, function calling, and domain - specific tasks (e.g., science diagrams, financial question answering).
Evaluation Benchmarks
Benchmark |
Score |
MMMU Val with chatGPT as a judge |
48.2% |
AI2D |
85.0% |
ChartQA |
86.3% |
InfoVQA Val |
77.4% |
OCRBench |
839 |
OCRBenchV2 English |
60.1% |
OCRBenchV2 Chinese |
37.9% |
DocVQA val |
91.2% |
VideoMME |
54.7% |
Inference
- Engine: TTensorRT - LLM
- Test Hardware: 1x NVIDIA H100 SXM 80GB
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices for AI development. When developers download or use the model according to the terms of service, they should work with their internal model team to ensure the model meets industry requirements and addresses potential misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.
Users are responsible for model inputs and outputs and for ensuring safe integration of the model, including implementing guardrails and other safety mechanisms before deployment.
Outputs generated by these models may contain political content, potentially misleading information, content security and safety issues, or unwanted bias.
📄 License
Your use of the model is governed by the NVIDIA Open License Agreement. Additional Information: Llama 3.1 Community Model License; Built with Llama.