InternVL3-38B-hf Open-source Multimodal Large Model - Free for Applications such as Image Analysis and 3D Perception

Internvl3 38B Hf

Developed by OpenGVLab

InternVL3-38B is an advanced multimodal large language model (MLLM) with significant improvements in multimodal perception and reasoning abilities, supporting areas such as tool use, GUI agents, industrial image analysis, and 3D visual perception.

Image-to-Text

Transformers

OtherOpen Source License:Other #Multimodal reasoning #Industrial image analysis #Tool invocation

Downloads 2,226

Release Time : 4/18/2025

Model Overview

InternVL3-38B is a multimodal large language model that supports the joint processing of images, videos, and text and has powerful multimodal reasoning abilities.

Model Features

Advanced multimodal capabilities

Compared with previous models, there are significant improvements in multimodal perception and reasoning abilities, supporting areas such as tool use, GUI agents, industrial image analysis, and 3D visual perception.

Efficient batch reasoning

As a native Transformers model, it supports the implementation of multiple attention mechanisms (including SDPA and FA2) and can efficiently process batch inputs containing images, videos, and text.

Multilingual support

Supports multiple languages and is suitable for users in different regions.

Model Capabilities

Image description generation

Video content understanding

Multimodal reasoning

Tool use

GUI agent

Industrial image analysis

3D visual perception

Text generation

Use Cases

Image understanding

Image description generation

Generate a detailed description of the input image.

Generate accurate and detailed image descriptions.

Video understanding

Video content analysis

Analyze and describe the content of the input video.

Accurately identify actions and content in the video.

Multimodal interaction

Multimodal chat

Supports the joint input and interaction of images, videos, and text.

Enable natural multimodal conversations.

🚀 InternVL3-38B Transformers 🤗 Implementation

This repository provides a Hugging Face 🤗 Transformers implementation of the InternVL3-38B model, enabling efficient multimodal processing with various input types.

Property	Details
Pipeline Tag	image-text-to-text
Library Name	transformers
Base Model	OpenGVLab/InternVL3-38B-Instruct
Base Model Relation	finetune
Datasets	OpenGVLab/MMPR-v1.2
Language	multilingual
Tags	internvl

[📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

⚠️ Important Note

This repository contains the Hugging Face 🤗 Transformers implementation for the OpenGVLab/InternVL3-38B model. It is intended to be functionally equivalent to the original OpenGVLab release. As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.

🚀 Quick Start

This repository offers a Hugging Face 🤗 Transformers implementation of the OpenGVLab/InternVL3-38B model. It aims to be functionally equivalent to the original OpenGVLab release and supports core library features like different attention implementations and efficient batched inference with interleaved image, video, and text inputs.

✨ Features

Advanced Multimodal Performance: InternVL3 demonstrates superior overall performance compared to previous versions, with enhanced multimodal perception and reasoning capabilities.
Extended Multimodal Capabilities: It extends its multimodal capabilities to tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.
Better Text Performance: Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves better overall text performance than the Qwen2.5 series.
Diverse Input Support: Supports various input types including single images, text-only, batched images and text, batched multi-images, video, and interleaved image and video inputs.

💻 Usage Examples

Basic Usage

Here are several basic usage examples demonstrating how to use the InternVL3 models for different input types:

Inference with Pipeline

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-38B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'

Inference on a single image

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-38B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Please describe the image explicitly."},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'

Advanced Usage

The following examples show more advanced usage scenarios, such as handling batched inputs and video inputs:

Batched image and text inputs

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-38B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']

Video input

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

>>> model_checkpoint = "OpenGVLab/InternVL3-38B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "video",
...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
...             },
...             {"type": "text", "text": "What type of shot is the man performing?"},
...         ],
...     }
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     return_tensors="pt",
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
>>> ).to(model.device, dtype=torch.float16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'

📄 License

This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Qwen License. You can find the detailed license information here.

📚 Documentation

You can find more detailed information about the InternVL3 family in the original checkpoint OpenGVLab/InternVL3-38B and the official documentation 📖 Documents.

📜 Citation

If you find this project useful in your research, please consider citing:

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご