InternVL3-8B-Instruct-GGUF開源多模態模型 - 免費使用，感知推理能力超強！

首頁

Internvl3 8B Instruct GGUF

由unsloth開發

InternVL3-8B-Instruct 是一個先進的多模態大語言模型（MLLM），展示了卓越的整體性能，具備強大的多模態感知和推理能力。

文本生成圖像

Transformers

開源協議:Apache-2.0 #多模態推理 #原生預訓練 #長上下文理解

下載量 2,412

發布時間 : 5/19/2025

模型概述

InternVL3-8B-Instruct 是 InternVL3 系列的 SFT 版本，經過了原生多模態預訓練和 SFT，但未經過 MPO。該模型支持多模態任務，包括工具使用、GUI 代理、工業圖像分析、3D 視覺感知等。

模型特點

原生多模態預訓練

將語言和視覺學習整合到一個預訓練階段，增強模型的多模態處理能力。

可變視覺位置編碼（V2PE）

使用更小、更靈活的位置增量處理視覺標記，提升長上下文理解能力。

多模態能力擴展

支持工具使用、GUI 代理、工業圖像分析、3D 視覺感知等多種任務。

高性能推理

在多項基準測試中表現出卓越的多模態推理和數學能力。

模型能力

多模態推理

OCR

圖表和文檔理解

多圖像和真實世界理解

視覺定位

多模態多語言理解

視頻理解

GUI 定位

空間推理

使用案例

工業應用

工業圖像分析

用於分析工業場景中的圖像，識別缺陷或異常。

教育

科學圖表理解

幫助學生理解和分析科學圖表中的信息。

娛樂

視頻內容理解

分析視頻內容，生成描述或回答相關問題。

🚀 InternVL3-8B-Instruct

InternVL3-8B-Instruct 是一款先進的多模態大語言模型，在多模態感知、推理等能力上表現出色，拓展了多模態能力的應用範圍，如工具使用、GUI 代理、工業圖像分析等。

🚀 快速開始

模型加載

16 位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8 位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多 GPU 情況

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-8B"
device_map = split_model('InternVL3-8B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用 Transformers 進行推理

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-8B'
device_map = split_model('InternVL3-8B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (純文本對話)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (單圖像單輪對話)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (單圖像多輪對話)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多圖像多輪對話，組合圖像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多圖像多輪對話，分離圖像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (單圖像批量推理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (視頻多輪對話)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式輸出

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主要特性

先進的多模態能力：相比 InternVL 2.5，InternVL3 展現出更卓越的多模態感知和推理能力，還將多模態能力拓展到工具使用、GUI 代理、工業圖像分析、3D 視覺感知等領域。
優秀的語言性能：得益於原生多模態預訓練，InternVL3 系列在整體文本性能上甚至優於 Qwen2.5 系列。
靈活的模型架構：沿用 “ViT - MLP - LLM” 範式，整合新的增量預訓練 InternViT 和多種預訓練 LLM，如 InternLM 3 和 Qwen 2.5。
創新的訓練策略：採用原生多模態預訓練、監督微調、混合偏好優化和測試時縮放等策略，提升模型性能。

📦 安裝指南

LMDeploy

# if lmdeploy<0.7.3, you need to explicitly set chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

若要使用 OpenAI 風格的接口，需要安裝 OpenAI：

pip install openai

💻 使用示例

基礎用法

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-8B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

高級用法

多圖像推理

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-8B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示推理

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-8B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多輪對話

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-8B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服務部署

lmdeploy serve api_server OpenGVLab/InternVL3-8B --chat-template internvl2_5 --server-port 23333 --tp 1

使用 OpenAI 風格接口進行 API 調用：

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📚 詳細文檔

InternVL3 家族

模型名稱	視覺部分	語言部分	Hugging Face 鏈接
InternVL3 - 1B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 0.5B](https://huggingface.co/Qwen/Qwen2.5 - 0.5B)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 1B)
InternVL3 - 2B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 1.5B](https://huggingface.co/Qwen/Qwen2.5 - 1.5B)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 2B)
InternVL3 - 8B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 8B)
InternVL3 - 9B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[internlm3 - 8b - instruct](https://huggingface.co/internlm/internlm3 - 8b - instruct)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 9B)
InternVL3 - 14B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 14B](https://huggingface.co/Qwen/Qwen2.5 - 14B)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 14B)
InternVL3 - 38B	[InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)	[Qwen2.5 - 32B](https://huggingface.co/Qwen/Qwen2.5 - 32B)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 38B)
InternVL3 - 78B	[InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)	[Qwen2.5 - 72B](https://huggingface.co/Qwen/Qwen2.5 - 72B)	[鏈接](https://huggingface.co/OpenGVLab/InternVL3 - 78B)

模型架構

[InternVL3](https://internvl.github.io/blog/2025 - 04 - 11 - InternVL - 3/) 沿用了 [InternVL 2.5](https://internvl.github.io/blog/2024 - 12 - 05 - InternVL - 2.5/) 及其前身 InternVL 1.5 和 2.0 的模型架構，遵循 “ViT - MLP - LLM” 範式。在新版本中，使用隨機初始化的 MLP 投影器，將新的增量預訓練 InternViT 與多種預訓練 LLM 整合。

同時，應用了像素重排操作，將視覺標記數量減少到原來的四分之一，並採用了與 InternVL 1.5 類似的動態分辨率策略，將圖像劃分為 448×448 像素的圖塊。從 InternVL 2.0 開始，還增加了對多圖像和視頻數據的支持。此外，InternVL3 集成了可變視覺位置編碼 (V2PE)，使模型在長上下文理解能力上優於前代。

訓練策略

原生多模態預訓練

提出原生多模態預訓練方法，將語言和視覺學習整合到一個預訓練階段。與先訓練純語言模型再適應其他模態的標準範式不同，該方法將多模態數據（如圖像 - 文本、視頻 - 文本或圖像 - 文本交錯序列）與大規模文本語料交織，使模型同時學習語言和多模態表示，無需單獨的對齊或橋接模塊即可處理視覺 - 語言任務。

監督微調

在 InternVL3 系列中，採用了 InternVL2.5 中提出的隨機 JPEG 壓縮、平方損失重新加權和多模態數據打包等技術。與 InternVL2.5 相比，InternVL3 在監督微調階段的主要改進在於使用了更高質量和更多樣化的訓練數據，進一步擴展了工具使用、3D 場景理解、GUI 操作、長上下文任務、視頻理解、科學圖表、創意寫作和多模態推理等方面的訓練樣本。

混合偏好優化

在預訓練和監督微調階段，模型基於先前的真實標記預測下一個標記；而在推理階段，模型根據自身先前的輸出預測每個標記。這種真實標記和模型預測標記之間的差異會引入分佈偏移，影響模型的思維鏈 (CoT) 推理能力。為緩解這一問題，採用 MPO，通過引入正樣本和負樣本的額外監督，使模型響應分佈與真實分佈對齊，從而提高推理性能。MPO 的訓練目標是偏好損失 $\mathcal{L}{\text{p}}$、質量損失 $\mathcal{L}{\text{q}}$ 和生成損失 $\mathcal{L}{\text{g}}$ 的組合，公式如下： $$ \mathcal{L}=w{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}{\text{g}}, $$ 其中 $w{*}$ 表示每個損失組件的權重。

測試時縮放

測試時縮放已被證明是提高大語言模型和多模態大語言模型推理能力的有效方法。在本工作中，使用 Best - of - N 評估策略，並採用 [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) 作為評判模型，為推理和數學評估選擇最佳響應。

多模態能力評估

多模態推理和數學
OCR、圖表和文檔理解
多圖像和現實世界理解
綜合多模態和幻覺評估
視覺定位
多模態多語言理解
視頻理解
GUI 定位
空間推理

語言能力評估

將 InternVL3 與 Qwen2.5 聊天模型進行比較，Qwen2.5 相應的預訓練基礎模型被用作 InternVL3 語言組件的初始化。得益於原生多模態預訓練，InternVL3 系列在整體文本性能上優於 Qwen2.5 系列。需要注意的是，Qwen2.5 系列的評估分數可能與官方報告的不同，因為在所有數據集上採用了表中提供的提示版本進行 OpenCompass 評估。

消融實驗

原生多模態預訓練

在 InternVL2 - 8B 模型上進行實驗，保持其架構、初始化參數和訓練數據完全不變。傳統上，InternVL2 - 8B 採用先進行 MLP 預熱階段進行特徵對齊，然後進行指令微調的訓練管道。在實驗中，用原生多模態預訓練過程替代了傳統的 MLP 預熱階段。評估結果表明，經過原生多模態預訓練的模型在大多數基準測試中的性能與經過完整多階段訓練的 InternVL2 - 8B 基線相當。此外，在更高質量數據上進行指令微調後，模型在評估的多模態任務中表現出進一步的性能提升。這些結果強調了原生多模態預訓練在賦予多模態大語言模型強大多模態能力方面的效率。

混合偏好優化

使用 MPO 進行微調的模型在七個多模態推理基準測試中表現出優於未使用 MPO 的模型的推理性能。具體而言，InternVL3 - 78B 和 InternVL3 - 38B 分別比其對應模型高出 4.1 和 4.5 分。值得注意的是，MPO 使用的訓練數據是監督微調使用數據的子集，這表明性能提升主要源於訓練算法而非訓練數據。

可變視覺位置編碼

引入 V2PE 導致大多數評估指標的性能顯著提升。此外，通過改變位置增量 $ \delta $ 的消融實驗表明，即使對於主要涉及常規上下文的任務，相對較小的 $ \delta $ 值也能實現最佳性能。這些發現為未來改進多模態大語言模型中視覺標記的位置編碼策略提供了重要見解。

🔧 技術細節

模型架構細節

範式遵循：遵循 “ViT - MLP - LLM” 範式，整合新的增量預訓練 InternViT 與多種預訓練 LLM。
像素操作：應用像素重排操作，減少視覺標記數量。
分辨率策略：採用動態分辨率策略，劃分圖像圖塊。
數據支持：從 InternVL 2.0 開始支持多圖像和視頻數據。
位置編碼：集成可變視覺位置編碼 (V2PE)，提升長上下文理解能力。

訓練策略細節

原生多模態預訓練

整合學習：將語言和視覺學習整合到一個預訓練階段。
數據交織：將多模態數據與大規模文本語料交織。
無需對齊：無需單獨的對齊或橋接模塊。

監督微調

技術沿用：沿用隨機 JPEG 壓縮、平方損失重新加權和多模態數據打包等技術。
數據升級：使用更高質量和更多樣化的訓練數據。

混合偏好優化

解決分佈偏移：引入正樣本和負樣本的額外監督，對齊模型響應分佈與真實分佈。
訓練目標組合：訓練目標是偏好損失、質量損失和生成損失的組合。

測試時縮放

評估策略：採用 Best - of - N 評估策略。
評判模型：使用 [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) 作為評判模型。

📄 許可證

本項目採用 MIT 許可證發佈。本項目使用預訓練的 Qwen2.5 作為組件，Qwen2.5 遵循 Apache - 2.0 許可證。

引用

如果您在研究中發現本項目有用，請考慮引用以下文獻：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}