InternVL3-38B-Instruct-GGUF Open-Source Multi-Modal Large Model - A Practical Choice with Strong Perception and Reasoning Abilities

Internvl3 38B Instruct GGUF

Developed by unsloth

InternVL3-38B-Instruct is an advanced Multimodal Large Language Model (MLLM) that demonstrates exceptional overall performance, with strong multimodal perception and reasoning capabilities.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Native Multimodal Pretraining #Multilingual Visual Reasoning #Long Context Understanding

Downloads 1,236

Release Time : 5/19/2025

Model Overview

InternVL3-38B-Instruct is the SFT version of the InternVL3 series, trained with native multimodal pretraining and supervised fine-tuning, supporting multimodal tasks such as image-text understanding, tool usage, GUI agents, industrial image analysis, and more.

Model Features

Native Multimodal Pretraining

Integrates language and visual learning into a single pretraining phase, enhancing multimodal representation capabilities.

Variable Visual Position Encoding (V2PE)

Uses smaller, more flexible position increments to process visual tokens, improving long-context understanding.

Mixed Preference Optimization (MPO)

Aligns model response distributions through positive and negative sample supervision, enhancing reasoning performance.

Dynamic Resolution Support

Supports multiple images and video data, dynamically processing inputs of varying resolutions.

Model Capabilities

Multimodal text generation

Image understanding

Video understanding

Tool usage

GUI agents

Industrial image analysis

3D visual perception

Multilingual support

Use Cases

Multimodal Reasoning

Image Caption Generation

Generates detailed descriptions based on input images.

Produces high-quality image captions, supporting multi-turn dialogues.

Video Understanding

Analyzes video content and generates descriptions.

Supports multi-frame video analysis, generating coherent video descriptions.

Tool Usage

GUI Operations

Generates operational instructions based on GUI screenshots.

Produces accurate GUI operation steps.

Industrial Applications

Industrial Image Analysis

Analyzes image data in industrial scenarios.

Supports complex industrial image understanding tasks.

🚀 InternVL3-38B-Instruct

InternVL3-38B-Instruct is an advanced multimodal large language model that combines vision and language capabilities, offering superior performance in various multimodal tasks.

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

[GitHub] [InternVL 1.0] [InternVL 1.5] [InternVL 2.5] [InternVL2.5-MPO] [InternVL3]

[Blog] [Chat Demo] [HF Demo] [Quick Start] [Documents]

🚀 Quick Start

The README does not provide specific quick - start steps. For detailed guidance, you can refer to the official documentation: [Documents].

✨ Features

Advanced Multimodal Capabilities: Compared to InternVL 2.5, InternVL3 shows superior multimodal perception and reasoning capabilities, and extends its capabilities to tool usage, GUI agents, industrial image analysis, 3D vision perception, etc.
Better Language Performance: Benefiting from Native Multimodal Pre - Training, the InternVL3 series achieves better overall text performance than the Qwen2.5 series.
Flexible Model Architecture: It retains the "ViT - MLP - LLM" paradigm and integrates a newly incrementally pre - trained InternViT with various pre - trained LLMs.

📚 Documentation

Introduction

This is the SFT version of InternVL3 - 38B, which has undergone native multimodal pre - training and SFT but has not undergone MPO. If you're unsure which version to use, please use the [InternVL3 - 38B](https://huggingface.co/OpenGVLab/InternVL3 - 38B) version.

We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre - trained base models are employed as the initialization of the language component in InternVL3. Benefitting from Native Multimodal Pre - Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/overall.png)

InternVL3 Family

In the following table, we provide an overview of the InternVL3 series.

Model Name	Vision Part	Language Part	HF Link
InternVL3 - 1B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 0.5B](https://huggingface.co/Qwen/Qwen2.5 - 0.5B)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 1B)
InternVL3 - 2B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 1.5B](https://huggingface.co/Qwen/Qwen2.5 - 1.5B)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 2B)
InternVL3 - 8B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 8B)
InternVL3 - 9B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[internlm3 - 8b - instruct](https://huggingface.co/internlm/internlm3 - 8b - instruct)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 9B)
InternVL3 - 14B	[InternViT - 300M - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 300M - 448px - V2_5)	[Qwen2.5 - 14B](https://huggingface.co/Qwen/Qwen2.5 - 14B)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 14B)
InternVL3 - 38B	[InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)	[Qwen2.5 - 32B](https://huggingface.co/Qwen/Qwen2.5 - 32B)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 38B)
InternVL3 - 78B	[InternViT - 6B - 448px - V2_5](https://huggingface.co/OpenGVLab/InternViT - 6B - 448px - V2_5)	[Qwen2.5 - 72B](https://huggingface.co/Qwen/Qwen2.5 - 72B)	[link](https://huggingface.co/OpenGVLab/InternVL3 - 78B)

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/overall - table.png)

Model Architecture

As shown in the following figure, [InternVL3](https://internvl.github.io/blog/2025 - 04 - 11 - InternVL - 3/) retains the same model architecture as [InternVL 2.5](https://internvl.github.io/blog/2024 - 12 - 05 - InternVL - 2.5/) and its predecessors, InternVL 1.5 and 2.0, following the "ViT - MLP - LLM" paradigm. In this new version, we integrate a newly incrementally pre - trained InternViT with various pre - trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector.

![image/png](https://cdn - uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/BiiyXN6NOk0p - 3rl3ueyL.png)

As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one - quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi - image and video data.

Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors.

Training Strategy

Native Multimodal Pre - Training

We propose a Native Multimodal Pre - Training approach that consolidates language and vision learning into a single pre - training stage. In contrast to standard paradigms that first train a language - only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image - text, video - text, or image - text interleaved sequences) with large - scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision - language tasks without the need for separate alignment or bridging modules. Please see our paper for more details.

Supervised Fine - Tuning

In this phase, the techniques of random JPEG compression, square loss re - weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher - quality and more diverse training data. Specifically, we further extend training samples for tool use, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning.

Mixed Preference Optimization

During Pre - training and SFT, the model is trained to predict the next token conditioned on previous ground - truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground - truth tokens and model - predicted tokens introduces a distribution shift, which can impair the model’s Chain - of - Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground - truth distribution, thereby improving reasoning performance. Specifically, the training objective of MPO is a combination of preference loss $\mathcal{L}{\text{p}}$, quality loss $\mathcal{L}{\text{q}}$, and generation loss $\mathcal{L}_{\text{g}}$, which can be formulated as follows:

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$

where $w_{*}$ represents the weight assigned to each loss component. Please see our paper for more details about MPO.

Test - Time Scaling

Test - Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best - of - N evaluation strategy and employ [VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B) as the critic model to select the best response for reasoning and mathematics evaluation.

Evaluation on Multimodal Capability

Multimodal Reasoning and Mathematics

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/reasoning.png)

OCR, Chart, and Document Understanding

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/ocr.png)

Multi - Image & Real - World Comprehension

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/multi - images.png)

Comprehensive Multimodal & Hallucination Evaluation

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/comprehensive.png)

Visual Grounding

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/grounding.png)

Multimodal Multilingual Understanding

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/multilingual.png)

Video Understanding

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/video.png)

GUI Grounding

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/gui.png)

Spatial Reasoning

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/vsi.png)

Evaluation on Language Capability

We compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre - trained base models are employed as the initialization of the language component in InternVL3. Benefitting from Native Multimodal Pre - Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. Please note that the evaluation scores of Qwen2.5 series may differ from those officially reported, as we have adopted the prompt versions provided in the table across all datasets for OpenCompass evaluation.

![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL - Performance/resolve/main/internvl3/text.png)

Ablation Study

Native Multimodal Pre - Training

We conduct experiments on the InternVL2 - 8B model while keeping its architecture, initialization parameters, and training data entirely unchanged. Traditionally, InternVL2 - 8B employs a training pipeline that begins with an MLP warmup phase for feature alignment followed by an Instruction Tuning stage. In our experiments, we substitute the conventional MLP warmup phase with a native multimodal pre - training process. This modification isolates the contribution of native multimodal pre - training to the overall multimodal capability of the model.

The evaluation results in the Figure below shows that the model with native multimodal pre - training exhibits performance on most benchmarks that is comparable to the fully multi - stage - trained InternVL2 - 8B baseline. Furthermore, when followed by instruction tuning on higher - quality data, the model demonstrates further performance gains across evaluated multimodal tasks.

🔧 Technical Details

Model Information

Property	Details
Model Type	Image - text - to - text
Library Name	transformers
Base Model	OpenGVLab/InternVL3 - 38B - Instruct
Base Model Relation	finetune
Language	multilingual
Tags	internvl, unsloth, custom_code

License

The model is licensed under the [Apache - 2.0](https://huggingface.co/Qwen/Qwen2.5 - 72B - Instruct/blob/main/LICENSE) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご