đ InternVL3-8B
InternVL3-8B is an advanced multimodal large language model that combines vision and language capabilities, achieving superior performance in various multimodal tasks.
[đą GitHub] [đ InternVL 1.0] [đ InternVL 1.5] [đ InternVL 2.5] [đ InternVL2.5-MPO] [đ InternVL3]
[đ Blog] [đŦ Chat Demo] [đ¤ HF Demo] [đ Quick Start] [đ Documents]
đ Documentation
⨠Features
- Advanced Multimodal Capabilities: InternVL3 demonstrates superior multimodal perception and reasoning capabilities, extending to tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.
- Native Multimodal Pre-Training: Consolidates language and vision learning into a single pre-training stage, enhancing the model's ability to handle vision-language tasks without separate alignment or bridging modules.
- Variable Visual Position Encoding (V2PE): Integrates V2PE, which utilizes smaller, more flexible position increments for visual tokens, resulting in better long context understanding capabilities.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-8B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
Advanced Usage
import math
import torch
from transformers import AutoTokenizer, AutoModel
def split_model(model_name):
device_map = {}
world_size = torch.cuda.device_count()
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers = config.llm_config.num_hidden_layers
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
path = "OpenGVLab/InternVL3-8B"
device_map = split_model('InternVL3-8B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
đ§ Technical Details
Model Architecture
As shown in the following figure, InternVL3 retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector.

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Ã448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.
Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors.
Training Strategy
- Native Multimodal Pre-Training: We propose a Native Multimodal Pre-Training approach that consolidates language and vision learning into a single pre-training stage. In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
- Supervised Fine-Tuning: In this phase, the techniques of random JPEG compression, square loss re-weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher-quality and more diverse training data.
- Mixed Preference Optimization: During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the model's Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance.
- Test-Time Scaling: Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation.
đ License
This project is licensed under the Apache-2.0 license.
đ Information Table
Property |
Details |
Pipeline Tag |
image-text-to-text |
Library Name |
transformers |
Base Model |
OpenGVLab/InternVL3-8B |
Base Model Relation |
finetune |
Datasets |
OpenGVLab/MMPR-v1.2 |
Language |
multilingual |
Tags |
internvl, unsloth, custom_code |
â ī¸ Important Note
Please use transformers>=4.37.2 to ensure the model works normally.