NVLM - D - 72B Open-Source Multimodal Large Language Model - Outstanding Performance in Vision-Language Task Processing

NVLM D 72B

Developed by nvidia

NVLM 1.0 is a series of cutting-edge multimodal large language models that achieve state-of-the-art results in vision-language tasks, comparable to leading proprietary and open-access models.

Image-to-Text

Transformers

English#Multimodal Reasoning #Optical Character Recognition #Vision-Language Tasks

Downloads 14.33k

Release Time : 9/30/2024

Model Overview

This model can perform vision-language and pure text tasks, including optical character recognition, multimodal reasoning, localization, commonsense reasoning, world knowledge utilization, and encoding.

Model Features

Multimodal Capabilities

Supports vision-language and pure text tasks with strong multimodal reasoning capabilities.

Superior Performance

Achieves state-of-the-art results in vision-language tasks, comparable to leading models like GPT-4o.

Enhanced Pure Text Performance

After multimodal training, its pure text performance improves compared to its LLM backbone model.

Model Capabilities

Optical Character Recognition

Multimodal Reasoning

Localization

Commonsense Reasoning

World Knowledge Utilization

Encoding

Use Cases

Vision-Language Tasks

Image Caption Generation

Generate detailed textual descriptions based on input images.

Visual Question Answering

Answer questions about input images.

Pure Text Tasks

Text Generation

Generate coherent and contextually relevant text.

Commonsense Reasoning

Perform logical reasoning based on commonsense knowledge.

🚀 NVLM 1.0

A family of frontier - class multimodal large language models achieving state - of - the - art results on vision - language tasks.

🚀 Quick Start

This README provides an overview of the NVLM 1.0 model, including its features, benchmark results, architecture, and usage instructions.

✨ Features

Performs vision - language and text - only tasks such as optical character recognition, multimodal reasoning, localization, common sense reasoning, world knowledge utilization, and coding.
Shows improved text - only performance over its LLM backbone after multimodal training.
Ready for non - commercial use.

📦 Installation

Prepare the environment

We provide a docker build file in the Dockerfile for reproduction. The docker image is based on nvcr.io/nvidia/pytorch:23.09 - py3.

⚠️ Important Note

We observe that different transformer versions / CUDA versions / docker versions can lead to slight benchmark number differences. We recommend using the Dockerfile above for precise reproduction.

💻 Usage Examples

Model loading

import torch
from transformers import AutoModel

path = "nvidia/NVLM-D-72B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True).eval()

Multiple GPUs

The model can be loaded on multiple GPUs as follows:

import torch
import math
from transformers import AutoModel

def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map=device_map).eval()

Inference

import torch
from transformers import AutoTokenizer, AutoModel
import math
from PIL import Image
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode


def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j == n
    )
    best_ratio = find_closest_aspect_ratio(aspect_ratio, target_ratios, orig_width, orig_height, image_size)
    transform = build_transform(image_size)
    if use_thumbnail:
        image.thumbnail((image_size * best_ratio[0], image_size * best_ratio[1]), Image.Resampling.LANCZOS)
    else:
        image = image.resize((image_size * best_ratio[0], image_size * best_ratio[1]), Image.Resampling.LANCZOS)
    image = transform(image)
    return image

📚 Documentation

Model Overview

Description

This family of models performs vision - language and text - only tasks including optical character recognition, multimodal reasoning, localization, common sense reasoning, world knowledge utilization, and coding. This model is ready for non - commercial use.

License/Terms of Use

Governing Terms: Deed - [Attribution - NonCommercial 4.0 International - Creative Commons](https://creativecommons.org/licenses/by - nc/4.0/deed.en). Additional Information: [LICENSE · Qwen/Qwen2 - 72B - Instruct at main](https://huggingface.co/Qwen/Qwen2 - 72B - Instruct/blob/main/LICENSE) for Qwen2 - 72B - Instruct and The MIT License – Open Source Initiative for InternViT - 6B - 448px - V1 - 2.

Model Details

On September 17th, 2024, we introduced NVLM 1.0, a family of frontier - class multimodal large language models (LLMs) that achieve state - of - the - art results on vision - language tasks, rivaling the leading proprietary models (e.g., GPT - 4o) and open - access models (e.g., Llama 3 - V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text - only performance over its LLM backbone after multimodal training.

In this repo, we are open - sourcing NVLM - 1.0 - D - 72B (decoder - only architecture), the decoder - only model weights and code for the community.

Reference(s)

Paper [Inference Code (HF)](https://huggingface.co/nvidia/NVLM - D - 72B/tree/main) [Training Code](https://github.com/NVIDIA/Megatron - LM/tree/NVLM - 1.0/examples/multimodal/nvlm) [Website](https://research.nvidia.com/labs/adlr/NVLM - 1/)

Benchmark Results

We train our model with legacy [Megatron - LM](https://github.com/NVIDIA/Megatron - LM/tree/main/megatron/legacy) and adapt the codebase to Huggingface for model hosting, reproducibility, and inference. We observe numerical differences between the Megatron and Huggingface codebases, which are within the expected range of variation. We provide the results from both the Huggingface codebase and the Megatron codebase for reproducibility and comparison with other models.

Results (as of September 17th, 2024) in the multimodal benchmarks are as follows:

Vision - language Benchmarks

Benchmark	MMMU (val / test)	MathVista	OCRBench	AI2D	ChartQA	DocVQA	TextVQA	RealWorldQA	VQAv2
NVLM - D 1.0 72B (Huggingface)	58.7 / 54.9	65.2	852	94.2	86.0	92.6	82.6	69.5	85.4
NVLM - D 1.0 72B (Megatron)	59.7 / 54.6	65.2	853	94.2	86.0	92.6	82.1	69.7	85.4
Llama 3.2 90B	60.3 / -	57.3	-	92.3	85.5	90.1	-	-	78.1
Llama 3 - V 70B	60.6 / -	-	-	93.0	83.2	92.2	83.4	-	79.1
Llama 3 - V 405B	64.5 / -	-	-	94.1	85.8	92.6	84.8	-	80.2
InternVL2 - Llama3 - 76B	55.2 / -	65.5	839	94.8	88.4	94.1	84.4	72.2	-
GPT - 4V	56.8 / 55.7	49.9	645	78.2	78.5	88.4	78.0	61.4	77.2
GPT - 4o	69.1 / -	63.8	736	94.2	85.7	92.8	-	-	-
Claude 3.5 Sonnet	68.3 / -	67.7	788	94.7	90.8	95.2	-	-	-
Gemini 1.5 Pro (Aug 2024)	62.2 / -	63.9	754	94.4	87.2	93.1	78.7	70.4	80.2

Text - only Benchmarks

Tasks	Backbone LLM	MMLU	GSM8K	MATH	HumanEval	Avg. Accuracy
Proprietary
GPT - 4.0	N/A	88.7	-	76.6	90.2	-
Gemini Pro 1.5 (Aug 2024)	N/A	85.9	90.8	67.7	84.1	82.1
Claude 3.5 Sonnet	N/A	88.7	96.4	71.1	92.0	87.0
Open LLM
(a) Nous - Hermes - 2 - Yi - 34B	N/A	75.5	78.6	21.8	43.3	54.8
(b) Qwen - 72B - Instruct	N/A	82.3	91.1	59.7	86.0	79.8
(c) Llama - 3 - 70B - Instruct	N/A	82.0	93.0	51.0	81.7	76.6
(d) Llama - 3.1 - 70B - Instruct	N/A	83.6	95.1	68.0	80.5	81.8
(e) Llama - 3.1 - 405B - Instruct	N/A	87.3	96.8	73.8	89.0	86.7
Open Multimodal LLM
VILA - 1.5 40B	(a)	73.3	67.5	16.8	34.1	🥶 47.9 (-6.9)
LLaVA - OneVision 72B	(b)	80.6	89.9	49.2	74.4	🥶 73.5 (-6.3)
InternVL - 2 - Llama3 - 76B	(c)	78.5	87.1	42.5	71.3	🥶 69.9 (-6.7)
*Llama 3 - V 70B	(d)	83.6	95.1	68.0	80.5	🙂 81.8 (0)
*Llama 3 - V 405B	(e)	87.3	96.8	73.8	89.0	🙂 86.7 (0)
NVLM - D 1.0 72B (Megatron)	(b)	82.0	92.9	73.1	88.4	🥳 84.1 (+4.3)
NVLM - D 1.0 72B (Huggingface)	(b)	81.7	93.2	73.1	89.0	🥳 84.3 (+4.5)

Model Architectures

Property	Details
Network Architecture	Decoder - Only Transformer
Text - only LLM backbone	[Qwen2 - 72B - Instruct](https://huggingface.co/Qwen/Qwen2 - 72B - Instruct)
Vision encoder	[InternViT - 6B](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V1 - 2)

Robustness

The model trained on this dataset cannot regenerate its training data:

The model has no image generation capability since its output is only text. Hence it cannot regenerate any image it would have seen during training.
The model cannot regenerate training text data: during training, the model takes text and images as inputs, and the model output (text) is conditioned on both inputs. During inference, without training images as input, the models would not be able to reproduce any part of the training text data.

Input

Property	Details
Input Type(s)	Text, Image
Input Format(s)	String, [Pillow Library - Supported Formats](https://pillow.readthedocs.io/en/stable/handbook/image - file - formats.html)
Input Dimensions	One - Dimensional (1D), Two Dimensional (2D)
Other Properties Related to Input	Maximum Token Length = 128K Tokens

Output

Property	Details
Output Type(s)	Text
Output Format	String
Model Output	1D
Other Properties Related to Output	None

How to use

When converting Megatron checkpoint to Huggingface, we adapt [InternVL codebase](https://huggingface.co/OpenGVLab/InternVL2 - Llama3 - 76B) to support model loading and multi - GPU inference in HF. We also use the tokenizer from [Qwen2.5 - 72B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - 72B - Instruct/tree/main) when adapting the tokenizer to Huggingface, as it contains extra special tokens for vision tasks, e.g., <|vision_pad|>. We train NVLM - 1.0 - D - 72B based on the [Qwen2 - 72B - Instruct](https://huggingface.co/Qwen/Qwen2 - 72B - Instruct/tree/main) text - only model and [InternViT - 6B - 448px - V1 - 5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V1 - 5) ViT model with our large - scale high - quality multimodal dataset. For training code, please refer to [Megatron - Core](https://github.com/NVIDIA/Megatron - LM/tree/NVLM - 1.0/examples/multimodal/nvlm).

🔧 Technical Details

The model is trained with legacy [Megatron - LM](https://github.com/NVIDIA/Megatron - LM/tree/main/megatron/legacy) and adapted to Huggingface for better hosting, reproducibility, and inference. There are numerical differences between the Megatron and Huggingface codebases, which are within the expected range of variation.

📄 License

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご