InternVL3-8B Open-Source Multimodal Large Language Model - Process Images and Videos for Free with Exceptional Perception and Reasoning Abilities

Internvl3 8B

Developed by unsloth

InternVL3 - 8B is an advanced multimodal large - language model with excellent multimodal perception and reasoning capabilities, capable of processing multimodal data such as images and videos.

Multimodal Alignment

Transformers

Open Source License:Apache-2.0 #Multimodal reasoning #Industrial image analysis #GUI agent

Downloads 224

Release Time : 5/18/2025

Model Overview

InternVL3 - 8B is a multimodal large - language model that supports the processing of multimodal data such as images and videos, and performs excellently in fields such as tool use, GUI agents, and industrial image analysis.

Model Features

Excellent performance

Compared with InternVL 2.5, InternVL3 demonstrates more outstanding multimodal perception and reasoning capabilities.

Multilingual support

Supports multiple languages and has a wider range of application scenarios.

Efficient training

Adopts a native multimodal pre - training method, integrating language and visual learning into one pre - training stage.

Variable visual position encoding (V2PE)

Uses smaller and more flexible position increments to improve long - context understanding ability.

Model Capabilities

Multimodal perception

Multimodal reasoning

Image processing

Video processing

Tool use

GUI agent

Industrial image analysis

3D visual perception

Use Cases

Industrial applications

Industrial image analysis

Used for image recognition and analysis tasks in industrial scenarios.

Human - computer interaction

GUI agent

Supports automated operations and interactions with graphical user interfaces.

Multimedia processing

Video understanding

Processes and analyzes video data to extract key information.

🚀 InternVL3-8B

InternVL3-8B is an advanced multimodal large language model that combines vision and language capabilities, achieving superior performance in various multimodal tasks.

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

[🐱 GitHub] [📄 InternVL 1.0] [📄 InternVL 1.5] [📄 InternVL 2.5] [📄 InternVL2.5-MPO] [📄 InternVL3]

[📝 Blog] [💬 Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📚 Documents]

📄 Documentation

✨ Features

Advanced Multimodal Capabilities: InternVL3 demonstrates superior multimodal perception and reasoning capabilities, extending to tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.
Native Multimodal Pre-Training: Consolidates language and vision learning into a single pre-training stage, enhancing the model's ability to handle vision-language tasks without separate alignment or bridging modules.
Variable Visual Position Encoding (V2PE): Integrates V2PE, which utilizes smaller, more flexible position increments for visual tokens, resulting in better long context understanding capabilities.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

Advanced Usage

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-8B"
device_map = split_model('InternVL3-8B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

🔧 Technical Details

Model Architecture

As shown in the following figure, InternVL3 retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector.

image/png

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.

Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors.

Training Strategy

Native Multimodal Pre-Training: We propose a Native Multimodal Pre-Training approach that consolidates language and vision learning into a single pre-training stage. In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
Supervised Fine-Tuning: In this phase, the techniques of random JPEG compression, square loss re-weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher-quality and more diverse training data.
Mixed Preference Optimization: During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the model's Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance.
Test-Time Scaling: Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation.

📄 License

This project is licensed under the Apache-2.0 license.

📊 Information Table

Property	Details
Pipeline Tag	image-text-to-text
Library Name	transformers
Base Model	OpenGVLab/InternVL3-8B
Base Model Relation	finetune
Datasets	OpenGVLab/MMPR-v1.2
Language	multilingual
Tags	internvl, unsloth, custom_code

⚠️ Important Note

Please use transformers>=4.37.2 to ensure the model works normally.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご