InternVL3-1B開源多模態大語言模型 - 免費部署實現卓越感知與推理

首頁

Internvl3 1B

由FriendliAI開發

InternVL3-1B是InternVL3系列中的1B參數規模多模態大語言模型，整合了InternViT視覺編碼器和Qwen2.5語言模型，具備卓越的多模態感知和推理能力。

Transformers

其他開源協議:其他 #多模態大語言模型 #原生多模態預訓練 #長上下文理解

下載量 71

發布時間 : 4/12/2025

模型概述

InternVL3-1B是一個先進的多模態大語言模型，結合了視覺和語言處理能力，支持圖像、視頻、文本等多種模態的輸入，適用於複雜的多模態理解和生成任務。

模型特點

原生多模態預訓練

將語言和視覺學習整合到一個預訓練階段，增強多模態任務處理能力。

可變視覺位置編碼（V2PE）

使用更小、更靈活的位置增量處理視覺標記，提升長上下文理解能力。

混合偏好優化（MPO）

通過正負樣本監督對齊模型響應分佈，提高推理性能。

動態分辨率策略

將圖像劃分為448×448像素的塊，支持多圖像和視頻數據。

模型能力

多模態推理

圖像理解

視頻理解

文本生成

OCR

圖表理解

文檔理解

GUI定位

空間推理

使用案例

工業圖像分析

工業缺陷檢測

通過圖像分析識別工業產品中的缺陷。

高精度識別缺陷，提升生產效率。

3D視覺感知

3D場景理解

分析3D場景中的物體和空間關係。

準確理解複雜3D場景。

工具使用

自動化工具操作

通過自然語言指令操作工具。

提升工具使用的便捷性和效率。

🚀 InternVL3-1B

InternVL3-1B 是先進的多模態大語言模型（MLLM）系列，相比 InternVL 2.5，它展現出更卓越的多模態感知和推理能力，還將多模態能力拓展到工具使用、GUI 代理、工業圖像分析、3D 視覺感知等領域。

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

🚀 快速開始

我們提供了使用 transformers 運行 InternVL3-1B 的示例代碼。

⚠️ 重要提示

請使用 transformers>=4.37.2 以確保模型正常工作。

模型加載

16 位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8 位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多 GPU 情況

以下代碼的編寫方式是為了避免在多 GPU 推理期間由於張量不在同一設備上而出現的錯誤。通過確保大語言模型（LLM）的第一層和最後一層在同一設備上，我們可以防止此類錯誤。

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-1B"
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 進行推理

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-1B'
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (純文本對話)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (單圖單輪對話)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (單圖多輪對話)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多圖多輪對話，拼接圖像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多圖多輪對話，獨立圖像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (單圖批處理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (視頻多輪對話)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式輸出

除了上述方法，你還可以使用以下代碼進行流式輸出。

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主要特性

多模態能力卓越：相比 InternVL 2.5，InternVL3 展現出更優越的多模態感知和推理能力，拓展到工具使用、GUI 代理、工業圖像分析、3D 視覺感知等領域。
語言性能出色：得益於原生多模態預訓練，InternVL3 系列在整體文本性能上甚至優於 Qwen2.5 系列。
創新技術集成：集成了 Variable Visual Position Encoding (V2PE)，具有更好的長上下文理解能力。

📦 安裝指南

LMDeploy 部署

LMDeploy 是一個用於壓縮、部署和服務大語言模型（LLM）和視覺語言模型（VLM）的工具包。

# 如果 lmdeploy<0.7.3，你需要顯式設置 chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

LMDeploy 將多模態視覺語言模型（VLM）複雜的推理過程抽象為一個易於使用的管道，類似於大語言模型（LLM）的推理管道。

“Hello, world” 示例

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

如果在執行此示例時出現 ImportError，請按提示安裝所需的依賴包。

多圖像推理

處理多圖像時，你可以將它們放在一個列表中。請記住，多圖像會導致更多的輸入令牌，因此通常需要增加上下文窗口的大小。

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# 為圖像編號有助於多圖像對話
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示推理

進行批量提示推理非常簡單，只需將它們放在一個列表結構中：

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多輪對話

使用管道進行多輪對話有兩種方法。一種是根據 OpenAI 的格式構造消息並使用上述介紹的方法，另一種是使用 pipeline.chat 接口。

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服務部署

LMDeploy 的 api_server 可以通過一個命令輕鬆將模型打包成服務。提供的 RESTful API 與 OpenAI 的接口兼容。以下是一個服務啟動示例：

lmdeploy serve api_server OpenGVLab/InternVL3-1B --chat-template internvl2_5 --server-port 23333 --tp 1

要使用 OpenAI 風格的接口，你需要安裝 OpenAI：

pip install openai

然後，使用以下代碼進行 API 調用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📚 詳細文檔

模型架構

如下圖所示，InternVL3 保留了與 InternVL 2.5 及其前身 InternVL 1.5 和 2.0 相同的模型架構，遵循 “ViT-MLP-LLM” 範式。在這個新版本中，我們使用隨機初始化的 MLP 投影器，將新的增量預訓練的 InternViT 與各種預訓練的大語言模型（LLM）集成，包括 InternLM 3 和 Qwen 2.5。

模型架構

與之前的版本一樣，我們應用了像素重排操作，將視覺令牌的數量減少到原來的四分之一。此外，我們採用了與 InternVL 1.5 類似的動態分辨率策略，將圖像分割成 448×448 像素的圖塊。從 InternVL 2.0 開始的主要區別在於，我們還增加了對多圖像和視頻數據的支持。

值得注意的是，在 InternVL3 中，我們集成了 Variable Visual Position Encoding (V2PE)，它為視覺令牌使用更小、更靈活的位置增量。得益於 V2PE，InternVL3 相比其前身表現出更好的長上下文理解能力。

訓練策略

原生多模態預訓練

我們提出了一種原生多模態預訓練方法，將語言和視覺學習整合到一個單一的預訓練階段。與先訓練僅語言模型，然後將其適應處理其他模態的標準範式不同，我們的方法將多模態數據（如圖文、視頻文本或圖文交錯序列）與大規模文本語料庫交織在一起。這種統一的訓練方案允許模型同時學習語言和多模態表示，最終提高其處理視覺語言任務的能力，而無需單獨的對齊或橋接模塊。更多細節請參閱我們的論文。

監督微調

在這個階段，InternVL2.5 中提出的隨機 JPEG 壓縮、平方損失重新加權和多模態數據打包技術也應用於 InternVL3 系列。與 InternVL2.5 相比，InternVL3 監督微調階段的主要進步在於使用了更高質量和更多樣化的訓練數據。具體來說，我們進一步擴展了工具使用、3D 場景理解、GUI 操作、長上下文任務、視頻理解、科學圖表、創意寫作和多模態推理的訓練樣本。

混合偏好優化

在預訓練和監督微調期間，模型根據之前的真實令牌來預測下一個令牌。然而，在推理期間，模型根據自己之前的輸出預測每個令牌。這種真實令牌和模型預測令牌之間的差異會引入分佈偏移，從而損害模型的思維鏈（CoT）推理能力。為了緩解這個問題，我們採用了 MPO，它引入了來自正樣本和負樣本的額外監督，使模型響應分佈與真實分佈對齊，從而提高推理性能。具體來說，MPO 的訓練目標是偏好損失 $\mathcal{L}{\text{p}}$、質量損失 $\mathcal{L}{\text{q}}$ 和生成損失 $\mathcal{L}_{\text{g}}$ 的組合，可以表示為：

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$

其中 $w_{*}$ 表示每個損失組件的權重。有關 MPO 的更多詳細信息，請參閱我們的論文。

測試時縮放

測試時縮放已被證明是提高大語言模型（LLM）和多模態大語言模型（MLLM）推理能力的有效方法。在這項工作中，我們使用 Best-of-N 評估策略，並採用 VisualPRM-8B 作為評估模型，為推理和數學評估選擇最佳響應。

評估

多模態能力評估

多模態推理和數學：展示了在多模態推理和數學任務上的性能。
OCR、圖表和文檔理解：評估了對 OCR、圖表和文檔的理解能力。
多圖像和現實世界理解：測試了對多圖像和現實世界場景的理解能力。
綜合多模態和幻覺評估：對模型的綜合多模態能力和幻覺情況進行評估。
視覺定位：評估了視覺定位能力。
多模態多語言理解：測試了多模態多語言理解能力。
視頻理解：評估了視頻理解能力。
GUI 定位：測試了 GUI 定位能力。
空間推理：評估了空間推理能力。

語言能力評估

我們將 InternVL3 與 Qwen2.5 Chat 模型進行了比較，Qwen2.5 的對應預訓練基礎模型被用作 InternVL3 語言組件的初始化。得益於原生多模態預訓練，InternVL3 系列在整體文本性能上甚至優於 Qwen2.5 系列。請注意，Qwen2.5 系列的評估分數可能與官方報告的不同，因為我們在所有數據集上都採用了表中提供的提示版本進行 OpenCompass 評估。

語言能力評估

消融實驗

原生多模態預訓練

我們在 InternVL2-8B 模型上進行了實驗，同時保持其架構、初始化參數和訓練數據完全不變。傳統上，InternVL2-8B 採用的訓練管道是先進行 MLP 預熱階段進行特徵對齊，然後進行指令微調階段。在我們的實驗中，我們用原生多模態預訓練過程取代了傳統的 MLP 預熱階段。這種修改隔離了原生多模態預訓練對模型整體多模態能力的貢獻。

下圖的評估結果表明，採用原生多模態預訓練的模型在大多數基準測試中的性能與經過完整多階段訓練的 InternVL2-8B 基線相當。此外，當在更高質量的數據上進行指令微調後，該模型在評估的多模態任務中表現出進一步的性能提升。這些發現強調了原生多模態預訓練在賦予多模態大語言模型強大多模態能力方面的效率。

原生多模態預訓練消融實驗

混合偏好優化

如下表所示，與未使用 MPO 進行微調的模型相比，使用 MPO 進行微調的模型在七個多模態推理基準測試中表現出更優越的推理性能。具體來說，InternVL3-78B 和 InternVL3-38B 分別比其對應模型高出 4.1 和 4.5 分。值得注意的是，MPO 使用的訓練數據是監督微調使用數據的子集，這表明性能提升主要源於訓練算法而非訓練數據。

混合偏好優化消融實驗

可變視覺位置編碼

如下表所示，引入 V2PE 導致大多數評估指標的性能顯著提升。此外，我們通過改變位置增量 $ \delta $ 進行的消融實驗表明，即使對於主要涉及傳統上下文的任務，相對較小的 $ \delta $ 值也能實現最佳性能。這些發現為未來改進多模態大語言模型中視覺令牌的位置編碼策略提供了重要見解。

可變視覺位置編碼消融實驗

🔧 技術細節

模型信息

屬性	詳情
模型類型	多模態大語言模型（MLLM）
基礎模型	OpenGVLab/InternViT-300M-448px-V2_5、Qwen/Qwen2.5-0.5B 等
基礎模型關係	合併
訓練數據集	OpenGVLab/MMPR-v1.2
支持語言	多語言
標籤	internvl、custom_code

微調支持

許多倉庫現在支持對 InternVL 系列模型進行微調，包括 InternVL、SWIFT、XTurner 等。有關微調的更多詳細信息，請參考它們的文檔。

📄 許可證

本項目遵循 MIT 許可證發佈。本項目使用了預訓練的 Qwen2.5 作為組件，該組件遵循 Qwen 許可證。

引用

如果您在研究中發現本項目有用，請考慮引用：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

對話系統

Transformers 英語