InternVL3-38B開源多模態大語言模型 - 感知推理強還拓展多模態能力

首頁

Internvl3 38B

由FriendliAI開發

InternVL3-38B 是一款先進的多模態大語言模型，在多模態感知、推理等能力上表現卓越，相較於前代模型有顯著提升，還拓展了工具使用、GUI 代理等多模態能力。

文本生成圖像

Transformers

其他開源協議:其他 #多模態推理 #工具使用代理 #動態分辨率處理

下載量 166

發布時間 : 4/12/2025

模型概述

InternVL3-38B 是一款多模態大語言模型，具備強大的多模態感知和推理能力，支持工具使用、GUI 代理等多種應用場景。

模型特點

先進的多模態能力

相比 InternVL 2.5，InternVL3 展現出更出色的多模態感知和推理能力，還將多模態能力拓展到工具使用、GUI 代理、工業圖像分析、3D 視覺感知等領域。

優秀的語言性能

與 Qwen2.5 Chat 模型相比，得益於原生多模態預訓練，InternVL3 系列在整體文本性能上表現更優。

靈活的模型架構

採用“ViT - MLP - LLM”範式，集成新的增量預訓練 InternViT 和多種預訓練大語言模型，如 InternLM 3 和 Qwen 2.5。

高效的訓練策略

提出原生多模態預訓練方法，將語言和視覺學習整合到一個預訓練階段；在監督微調階段使用高質量、多樣化的訓練數據；採用混合偏好優化（MPO）方法提升推理性能。

模型能力

多模態感知

多模態推理

工具使用

GUI 代理

工業圖像分析

3D 視覺感知

文本生成

圖像分析

使用案例

多模態推理

多模態推理任務

在多個多模態推理基準測試中表現出色。

InternVL3-38B 比其對應模型高出 4.5 分。

GUI 操作

GUI 代理

支持 GUI 操作任務。

工業圖像分析

支持工業圖像分析任務。

🚀 InternVL3-38B

【GitHub】【InternVL 1.0】【InternVL 1.5】【InternVL 2.5】【InternVL2.5-MPO】【InternVL3】

【博客】【聊天演示】【HF 演示】【快速開始】【文檔】

🚀 快速開始

我們提供了使用 transformers 運行 InternVL3-38B 的示例代碼。

⚠️ 重要提示

請使用 transformers>=4.37.2 以確保模型正常工作。

模型加載

16 位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-38B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8 位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-38B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多 GPU

編寫此代碼的原因是為了避免在多 GPU 推理期間由於張量不在同一設備上而發生的錯誤。通過確保大語言模型（LLM）的第一層和最後一層在同一設備上，我們可以防止此類錯誤。

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-38B"
device_map = split_model('InternVL3-38B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 進行推理

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_lay

✨ 主要特性

先進的多模態能力：相比 InternVL 2.5，InternVL3 展現出更出色的多模態感知和推理能力，還將多模態能力拓展到工具使用、GUI 代理、工業圖像分析、3D 視覺感知等領域。
優秀的語言性能：與 Qwen2.5 Chat 模型相比，得益於原生多模態預訓練，InternVL3 系列在整體文本性能上表現更優。
靈活的模型架構：採用“ViT - MLP - LLM”範式，集成新的增量預訓練 InternViT 和多種預訓練大語言模型，如 InternLM 3 和 Qwen 2.5。
高效的訓練策略：提出原生多模態預訓練方法，將語言和視覺學習整合到一個預訓練階段；在監督微調階段使用高質量、多樣化的訓練數據；採用混合偏好優化（MPO）方法提升推理性能。

📦 模型信息

屬性	詳情
模型類型	多模態大語言模型
基礎模型	OpenGVLab/InternViT - 6B - 448px - V2_5、Qwen/Qwen2.5 - 32B
基礎模型關係	合併
訓練數據集	OpenGVLab/MMPR - v1.2
支持語言	多語言
標籤	internvl、custom_code
許可證	qwen

📚 詳細文檔

模型架構

如下圖所示，InternVL3 保留了與 InternVL 2.5 及其前身 InternVL 1.5 和 2.0 相同的模型架構，遵循“ViT - MLP - LLM”範式。在這個新版本中，我們使用隨機初始化的 MLP 投影器，將新的增量預訓練 InternViT 與多種預訓練大語言模型（包括 InternLM 3 和 Qwen 2.5）集成在一起。

模型架構圖

與之前的版本一樣，我們應用了像素重排操作，將視覺標記的數量減少到原來的四分之一。此外，我們採用了與 InternVL 1.5 類似的動態分辨率策略，將圖像劃分為 448×448 像素的圖塊。從 InternVL 2.0 開始，關鍵的區別在於我們還引入了對多圖像和視頻數據的支持。

值得注意的是，在 InternVL3 中，我們集成了可變視覺位置編碼（V2PE），它為視覺標記使用更小、更靈活的位置增量。得益於 V2PE，InternVL3 與其前身相比，表現出更好的長上下文理解能力。

訓練策略

原生多模態預訓練

我們提出了一種原生多模態預訓練方法，將語言和視覺學習整合到一個預訓練階段。與先訓練純語言模型，然後使其適應處理其他模態的標準範式不同，我們的方法將多模態數據（如圖文、視頻文本或圖文交錯序列）與大規模文本語料庫交織在一起。這種統一的訓練方案使模型能夠同時學習語言和多模態表示，最終增強其處理視覺語言任務的能力，而無需單獨的對齊或橋接模塊。更多細節請參考我們的論文。

監督微調

在這個階段，InternVL2.5 中提出的隨機 JPEG 壓縮、平方損失重新加權和多模態數據打包技術也應用於 InternVL3 系列。InternVL3 在監督微調階段與 InternVL2.5 相比的主要進步在於使用了更高質量、更多樣化的訓練數據。具體來說，我們進一步擴展了工具使用、3D 場景理解、GUI 操作、長上下文任務、視頻理解、科學圖表、創意寫作和多模態推理的訓練樣本。

混合偏好優化

在預訓練和監督微調期間，模型根據之前的真實標記來預測下一個標記。然而，在推理期間，模型根據自己的先驗輸出來預測每個標記。真實標記和模型預測標記之間的這種差異會引入分佈偏移，這可能會削弱模型的思維鏈（CoT）推理能力。為了緩解這個問題，我們採用了 MPO 方法，它引入了來自正樣本和負樣本的額外監督，以使模型響應分佈與真實分佈對齊，從而提高推理性能。具體來說，MPO 的訓練目標是偏好損失 $\mathcal{L}{\text{p}}$、質量損失 $\mathcal{L}{\text{q}}$ 和生成損失 $\mathcal{L}_{\text{g}}$ 的組合，可以表示為：

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}} $$

其中 $w_{*}$ 表示每個損失組件的權重。更多關於 MPO 的細節請參考我們的論文。

測試時縮放

測試時縮放已被證明是一種有效的方法，可以增強大語言模型和多模態大語言模型的推理能力。在這項工作中，我們使用 Best - of - N 評估策略，並採用 [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) 作為評估模型，為推理和數學評估選擇最佳響應。

🔧 技術細節

原生多模態預訓練

我們在 InternVL2 - 8B 模型上進行實驗，同時保持其架構、初始化參數和訓練數據完全不變。傳統上，InternVL2 - 8B 採用的訓練流程是先進行 MLP 預熱階段進行特徵對齊，然後進行指令微調階段。在我們的實驗中，我們用原生多模態預訓練過程代替了傳統的 MLP 預熱階段。這種修改隔離了原生多模態預訓練對模型整體多模態能力的貢獻。

下圖的評估結果表明，採用原生多模態預訓練的模型在大多數基準測試中的性能與經過完整多階段訓練的 InternVL2 - 8B 基線相當。此外，在使用更高質量數據進行指令微調後，模型在評估的多模態任務中表現出進一步的性能提升。這些發現強調了原生多模態預訓練在賦予多模態大語言模型強大多模態能力方面的效率。

原生多模態預訓練評估結果

混合偏好優化

如下表所示，與未使用 MPO 進行微調的模型相比，使用 MPO 進行微調的模型在七個多模態推理基準測試中表現出更優的推理性能。具體來說，InternVL3 - 78B 和 InternVL3 - 38B 分別比其對應模型高出 4.1 和 4.5 分。值得注意的是，MPO 使用的訓練數據是監督微調使用數據的子集，這表明性能提升主要源於訓練算法，而非訓練數據。

混合偏好優化評估結果