NVLM - D - 72B開源多模態大語言模型 - 視覺語言任務處理表現卓越

首頁

NVLM D 72B

由nvidia開發

NVLM 1.0 是一系列前沿的多模態大語言模型，在視覺語言任務上取得了最先進的結果，可與領先的專有模型和開放訪問模型相媲美。

圖像生成文本

Transformers

英語#多模態推理 #光學字符識別 #視覺語言任務

下載量 14.33k

發布時間 : 9/30/2024

模型概述

該模型能夠執行視覺語言和純文本任務，包括光學字符識別、多模態推理、定位、常識推理、世界知識利用和編碼。

模型特點

多模態能力

支持視覺語言和純文本任務，具備強大的多模態推理能力。

性能優越

在視覺語言任務上取得了最先進的結果，可與 GPT-4o 等領先模型媲美。

純文本性能提升

在多模態訓練後，其純文本性能比其 LLM 骨幹模型有所提升。

模型能力

光學字符識別

多模態推理

定位

常識推理

世界知識利用

編碼

使用案例

視覺語言任務

圖像描述生成

根據輸入圖像生成詳細的文本描述。

視覺問答

回答關於輸入圖像的問題。

純文本任務

文本生成

生成連貫且上下文相關的文本。

常識推理

基於常識進行邏輯推理。

🚀 NVLM 1.0

NVLM 1.0是一系列前沿的多模態大語言模型，在視覺語言任務上取得了領先成果，可與GPT - 4o等專有模型以及Llama 3 - V 405B等開源模型相媲美。

🚀 快速開始

今天（2024年9月17日），我們推出了NVLM 1.0，這是一系列前沿級別的多模態大語言模型（LLMs），在視覺語言任務上取得了最先進的成果，可與領先的專有模型（如GPT - 4o）和開放訪問模型（如Llama 3 - V 405B和InternVL 2）相媲美。值得注意的是，NVLM 1.0在多模態訓練後，其純文本性能相較於其大語言模型骨幹有所提升。

在這個倉庫中，我們將NVLM - 1.0 - D - 72B（僅解碼器架構）的模型權重和代碼開源給社區。

✨ 主要特性

該系列模型可執行視覺語言和純文本任務，包括光學字符識別、多模態推理、定位、常識推理、世界知識利用和編碼。
模型經過多模態訓練後，純文本性能相較於骨幹大語言模型有所提升。

📦 安裝指南

準備環境

我們在Dockerfile中提供了一個Docker構建文件，用於復現實驗。

Docker鏡像基於nvcr.io/nvidia/pytorch:23.09 - py3。

⚠️ 重要提示

我們觀察到不同的Transformer版本、CUDA版本和Docker版本可能會導致基準測試結果出現細微差異。建議使用上述Dockerfile進行精確復現。

💻 使用示例

基礎用法

import torch
from transformers import AutoModel

path = "nvidia/NVLM-D-72B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True).eval()

高級用法

多GPU加載模型

import torch
import math
from transformers import AutoModel

def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map=device_map).eval()

推理示例

import torch
from transformers import AutoTokenizer, AutoModel
import math
from PIL import Image
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode


def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    trust_remote_code=True,
    device_map=device_map).eval()

print(model)

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
generation_config = dict(max_new_tokens=1024, do_sample=False)

# pure-text conversation
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation
pixel_values = load_image('path/to/your/example/image.jpg', max_num=6).to(
    torch.bfloat16)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

基準測試評估

python run_eval.py --config-path eval/full_eval.yaml \
 --result-save-path path/to/eval_results/ \
 --zero-shot-eval-tasks chartqa coco_caption flickr30k_caption vqav2 mmmu textvqa mathvista mmbench chartqa docvqa realworldqa ocrbench ai2diagram ai2diagram_nomask mmmu_pro docvqa_test

具體來說：

--config-path eval/full_eval.yaml 文件包含評估配置，包括評估提示、評估數據集路徑和生成超參數。
--result-save-path path/to/eval_results/ 指定保存評估結果的路徑。
--zero-shot-eval-tasks 指定要評估的任務。

📚 詳細文檔

參考資料

基準測試結果

我們使用舊版Megatron - LM訓練模型，並將代碼庫適配到Huggingface以進行模型託管、復現和推理。

我們觀察到Megatron和Huggingface代碼庫之間存在數值差異，但這些差異在預期的變化範圍內。

為了復現和與其他模型進行比較，我們提供了Huggingface代碼庫和Megatron代碼庫的結果。

截至2024年9月17日，多模態基準測試的結果如下：

視覺語言基準測試

基準測試	MMMU (驗證集 / 測試集)	MathVista	OCRBench	AI2D	ChartQA	DocVQA	TextVQA	RealWorldQA	VQAv2
NVLM - D 1.0 72B (Huggingface)	58.7 / 54.9	65.2	852	94.2	86.0	92.6	82.6	69.5	85.4
NVLM - D 1.0 72B (Megatron)	59.7 / 54.6	65.2	853	94.2	86.0	92.6	82.1	69.7	85.4
Llama 3.2 90B	60.3 / -	57.3	-	92.3	85.5	90.1	-	-	78.1
Llama 3 - V 70B	60.6 / -	-	-	93.0	83.2	92.2	83.4	-	79.1
Llama 3 - V 405B	64.5 / -	-	-	94.1	85.8	92.6	84.8	-	80.2
InternVL2 - Llama3 - 76B	55.2 / -	65.5	839	94.8	88.4	94.1	84.4	72.2	-
GPT - 4V	56.8 / 55.7	49.9	645	78.2	78.5	88.4	78.0	61.4	77.2
GPT - 4o	69.1 / -	63.8	736	94.2	85.7	92.8	-	-	-
Claude 3.5 Sonnet	68.3 / -	67.7	788	94.7	90.8	95.2	-	-	-
Gemini 1.5 Pro (2024年8月)	62.2 / -	63.9	754	94.4	87.2	93.1	78.7	70.4	80.2

純文本基準測試

任務	骨幹大語言模型	MMLU	GSM8K	MATH	HumanEval	平均準確率
專有模型
GPT - 4.0	N/A	88.7	-	76.6	90.2	-
Gemini Pro 1.5 (2024年8月)	N/A	85.9	90.8	67.7	84.1	82.1
Claude 3.5 Sonnet	N/A	88.7	96.4	71.1	92.0	87.0
開源大語言模型
(a) Nous - Hermes - 2 - Yi - 34B	N/A	75.5	78.6	21.8	43.3	54.8
(b) Qwen - 72B - Instruct	N/A	82.3	91.1	59.7	86.0	79.8
(c) Llama - 3 - 70B - Instruct	N/A	82.0	93.0	51.0	81.7	76.6
(d) Llama - 3.1 - 70B - Instruct	N/A	83.6	95.1	68.0	80.5	81.8
(e) Llama - 3.1 - 405B - Instruct	N/A	87.3	96.8	73.8	89.0	86.7
開源多模態大語言模型
VILA - 1.5 40B	(a)	73.3	67.5	16.8	34.1	🥶 47.9 (-6.9)
LLaVA - OneVision 72B	(b)	80.6	89.9	49.2	74.4	🥶 73.5 (-6.3)
InternVL - 2 - Llama3 - 76B	(c)	78.5	87.1	42.5	71.3	🥶 69.9 (-6.7)
*Llama 3 - V 70B	(d)	83.6	95.1	68.0	80.5	🙂 81.8 (0)
*Llama 3 - V 405B	(e)	87.3	96.8	73.8	89.0	🙂 86.7 (0)
NVLM - D 1.0 72B (Megatron)	(b)	82.0	92.9	73.1	88.4	🥳 84.1 (+4.3)
NVLM - D 1.0 72B (Huggingface)	(b)	81.7	93.2	73.1	89.0	🥳 84.3 (+4.5)

模型架構

屬性	詳情
網絡架構	僅解碼器Transformer
純文本大語言模型骨幹	Qwen2 - 72B - Instruct
視覺編碼器	InternViT - 6B

魯棒性

在該數據集上訓練的模型無法再生其訓練數據：

由於模型的輸出僅為文本，因此它沒有圖像生成能力，無法再生訓練期間看到的任何圖像。
模型無法再生訓練文本數據：在訓練期間，模型將文本和圖像作為輸入，模型輸出（文本）取決於這兩個輸入。在推理期間，如果沒有訓練圖像作為輸入，模型將無法再現訓練文本數據的任何部分。

輸入

輸入類型：文本、圖像
輸入格式：字符串、Pillow庫支持的格式
輸入維度：一維（1D）、二維（2D）
其他輸入相關屬性：最大令牌長度 = 128K 令牌

輸出

輸出類型：文本
輸出格式：字符串
模型輸出維度：1D
其他輸出相關屬性：無

軟件集成

屬性	詳情
運行時引擎	PyTorch
支持的硬件微架構兼容性	NVIDIA Hopper
首選/支持的操作系統	Linux

推理

屬性	詳情
引擎	PyTorch
測試硬件	H100

模型版本

v1.0 - D (NVLM - D)

訓練、測試和評估數據集

預訓練數據集

鏈接：見表4
數據收集方法：混合：自動、人工、合成、未知
標註方法：混合：自動、人工、合成、未知
屬性：在圖像標題、圖像 - 文本對、自然圖像、圖表、文檔、場景描述和數學推理上進行訓練。

監督微調數據集

鏈接：見表6
數據收集方法：混合：自動、人工、合成、未知
標註方法：混合：自動、人工、合成、未知
屬性：在圖像標題、常識知識、圖像 - 文本對、自然圖像、圖表、圖表、文檔、場景描述、科學圖表、課程、教科書數據和問答對、視覺指令調整和數學推理上進行訓練。

評估數據集

鏈接：見第6.1節“基準測試”
數據收集方法：人工
標註方法：人工
屬性：在常識知識、視覺問答、圖表理解、表格、光學字符識別和數學推理上進行評估。

聯繫方式

Wenliang Dai* (wdai@nvidia.com)、Nayeon Lee* (nayeonl@nvidia.com)、Boxin Wang* (boxinw@nvidia.com)、Zhuolin Yang* (zhuoliny@nvidia.com)、Wei Ping* (wping@nvidia.com)

*同等貢獻

引用

@article{nvlm2024,
  title={NVLM: Open Frontier-Class Multimodal LLMs},
  author={Dai, Wenliang and Lee, Nayeon and Wang, Boxin and Yang, Zhuolin and Liu, Zihan and Barker, Jon and Rintamaki, Tuomas and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  journal={arXiv preprint},
  year={2024}
}