Open-source InternVL3-78B Multimodal Large Language Model - Excellent in Multi-domain Perceptual Reasoning and Tool Usage

Internvl3 78B Pretrained

Developed by OpenGVLab

InternVL3-78B is an advanced multimodal large language model developed by OpenGVLab, demonstrating exceptional comprehensive performance. Compared to its predecessor InternVL 2.5, it possesses stronger multimodal perception and reasoning capabilities, extending its abilities to new domains such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.

Text-to-Image

Transformers

OtherOpen Source License:Other #Multimodal Large Model #Native Pretraining #Vision-Language Understanding

Downloads 22

Release Time : 4/17/2025

Model Overview

InternVL3-78B is a version that has completed native multimodal pretraining but has not undergone post-training. It adopts the 'ViT-MLP-LLM' architecture, supports multiple images and video data, and has long-context understanding capabilities.

Model Features

Native Multimodal Pretraining

Unified training of language and vision learning to enhance multimodal task processing capabilities

Variable Visual Position Encoding (V2PE)

Adopts smaller and more flexible position increments to improve long-context understanding

Multimodal Capability Expansion

Supports new domains such as tool usage, GUI agents, industrial image analysis, and 3D visual perception

Dynamic Resolution Processing

Divides images into 448×448 pixel tiles, supporting multiple images and video data

Model Capabilities

Multimodal reasoning

Image caption generation

Visual question answering

Document understanding

Video understanding

GUI operation understanding

3D scene understanding

Multilingual support

Use Cases

Intelligent Customer Service

Multimodal Customer Service Assistant

Resolves user issues through image and text interaction

Improves customer service efficiency and user experience

Content Generation

Image-text Content Creation

Generates descriptive or creative text based on images

Automates content production workflows

Industrial Inspection

Defect Analysis

Analyzes industrial images and describes defect conditions

Enhances quality inspection efficiency and accuracy

🚀 InternVL3-78B-Pretrained

InternVL3-78B-Pretrained is a pre - trained multimodal large language model. It has native multimodal pre - training but no post - training. It shows superior overall performance in multimodal perception, reasoning, and text processing.

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5 - MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

🚀 Quick Start

We provide an example code to run InternVL3 - 78B using transformers.

⚠️ Important Note

Please use transformers>=4.37.2 to ensure the model works normally.

💻 Usage Examples

Basic Usage

# Model Loading - 16-bit (bf16 / fp16)
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-78B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Advanced Usage

# Model Loading - BNB 8-bit Quantization
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-78B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

# Model Loading - Multiple GPUs
import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-78B"
device_map = split_model('InternVL3-78B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

# Inference with Transformers
import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers

✨ Features

This is the pretrained version of InternVL3 - 78B, which has undergone native multimodal pre - training but has not undergone post - training (i.e., SFT and MPO). If you're unsure which version to use, please use the [InternVL3 - 78B](https://huggingface.co/OpenGVLab/InternVL3 - 78B) version.

We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre - trained base models are employed as the initialization of the language component in InternVL3. Benefitting from Native Multimodal Pre - Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/overall.png)

📚 Documentation

InternVL3 Family

Property	Details
Model Name	InternVL3 - 1B, InternVL3 - 2B, InternVL3 - 8B, InternVL3 - 9B, InternVL3 - 14B, InternVL3 - 38B, InternVL3 - 78B
Vision Part	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5), [InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)
Language Part	[Qwen2.5 - 0.5B](https://huggingface.co/Qwen/Qwen2.5 - 0.5B), [Qwen2.5 - 1.5B](https://huggingface.co/Qwen/Qwen2.5 - 1.5B), [Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B), [internlm3 - 8b - instruct](https://huggingface.co/internlm/internlm3 - 8b - instruct), [Qwen2.5 - 14B](https://huggingface.co/Qwen/Qwen2.5 - 14B), [Qwen2.5 - 32B](https://huggingface.co/Qwen/Qwen2.5 - 32B), [Qwen2.5 - 72B](https://huggingface.co/Qwen/Qwen2.5 - 72B)
HF Link	[🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 1B), [🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 2B), [🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 8B), [🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 9B), [🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 14B), [🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 38B), [🤗 link](https://huggingface.co/OpenGVLab/InternVL3 - 78B)

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/overall - table.png)

Model Architecture

As shown in the following figure, [InternVL3](https://internvl.github.io/blog/2025 - 04 - 11 - InternVL - 3/) retains the same model architecture as [InternVL 2.5](https://internvl.github.io/blog/2024 - 12 - 05 - InternVL - 2.5/) and its predecessors, InternVL 1.5 and 2.0, following the "ViT - MLP - LLM" paradigm. In this new version, we integrate a newly incrementally pre - trained InternViT with various pre - trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector.

![image/png](https://cdn - uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/BiiyXN6NOk0p - 3rl3ueyL.png)

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one - quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi - image and video data.

Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors.

Training Strategy

Native Multimodal Pre - Training

We propose a Native Multimodal Pre - Training approach that consolidates language and vision learning into a single pre - training stage. In contrast to standard paradigms that first train a language - only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image - text, video - text, or image - text interleaved sequences) with large - scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision - language tasks without the need for separate alignment or bridging modules. Please see our paper for more details.

Supervised Fine - Tuning

In this phase, the techniques of random JPEG compression, square loss re - weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher - quality and more diverse training data. Specifically, we further extend training samples for tool use, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning.

Mixed Preference Optimization

During Pre - training and SFT, the model is trained to predict the next token conditioned on previous ground - truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground - truth tokens and model - predicted tokens introduces a distribution shift, which can impair the model’s Chain - of - Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground - truth distribution, thereby improving reasoning performance. Specifically, the training objective of MPO is a combination of preference loss $\mathcal{L}{\text{p}}$, quality loss $\mathcal{L}{\text{q}}$, and generation loss $\mathcal{L}_{\text{g}}$, which can be formulated as follows:

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$

where $w_{*}$ represents the weight assigned to each loss component. Please see our paper for more details about MPO.

Test - Time Scaling

Test - Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best - of - N evaluation strategy and employ [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) as the critic model to select the best response for reasoning and mathematics evaluation.

Evaluation on Multimodal Capability

Multimodal Reasoning and Mathematics ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/reasoning.png)
OCR, Chart, and Document Understanding ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ocr.png)
Multi - Image & Real - World Comprehension ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/multi - images.png)
Comprehensive Multimodal & Hallucination Evaluation ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/comprehensive.png)
Visual Grounding ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/grounding.png)
Multimodal Multilingual Understanding ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/multilingual.png)
Video Understanding ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/video.png)
GUI Grounding ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/gui.png)
Spatial Reasoning ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/vsi.png)

Evaluation on Language Capability

We compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre - trained base models are employed as the initialization of the language component in InternVL3. Benefitting from Native Multimodal Pre - Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. Please note that the evaluation scores of Qwen2.5 series may differ from those officially reported, as we have adopted the prompt versions provided in the table across all datasets for OpenCompass evaluation.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/text.png)

Ablation Study

Native Multimodal Pre - Training

We conduct experiments on the InternVL2 - 8B model while keeping its architecture, initialization parameters, and training data entirely unchanged. Traditionally, InternVL2 - 8B employs a training pipeline that begins with an MLP warmup phase for feature alignment followed by an Instruction Tuning stage. In our experiments, we substitute the conventional MLP warmup phase with a native multimodal pre - training process. This modification isolates the contribution of native multimodal pre - training to the overall multimodal capability of the model.

The evaluation results in the Figure below shows that the model with native multimodal pre - training exhibits performance on most benchmarks that is comparable to the fully multi - stage - trained InternVL2 - 8B baseline. Furthermore, when followed by instruction tuning on higher - quality data, the model demonstrates further performance gains across evaluated multimodal tasks. These findings underscore the efficiency of native multimodal pre - training in imparting powerful multimodal capabilities to MLLMs.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ablation - native.png)

Mixed Preference Optimization

As shown in the table below, models fine - tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3 - 78B and InternVL3 - 38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ablation - mpo.png)

Variable Visual Position Encoding

As reported in the table below, the introduction of V2PE leads to significant performance gains across most evaluation metrics. In addition, our ablation studies—by varying the positional increment $ \delta $—reveal that even for tasks primarily involving conventional contexts, relatively small $ \delta $ values can achieve optimal performance. These findings provide important insights for future efforts aimed at refining position encoding strategies for visual tokens in MLLMs.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ablation - v2pe.png)

🔧 Technical Details

The technical details involve aspects such as model architecture, training strategies, and evaluation methods. The model follows the "ViT - MLP - LLM" paradigm, integrates a newly incrementally pre - trained InternViT with various pre - trained LLMs, and applies techniques like pixel unshuffle operation, dynamic resolution strategy, and Variable Visual Position Encoding (V2PE). The training strategies include Native Multimodal Pre - Training, Supervised Fine - Tuning, Mixed Preference Optimization, and Test - Time Scaling. Evaluation is carried out on both multimodal and language capabilities, and ablation studies are conducted to analyze the effectiveness of different components.

📄 License

The license of this project is [qwen](https://huggingface.co/Qwen/Qwen2.5 - 72B - Instruct/blob/main/LICENSE).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご