InternVL3-14B-Instruct-GGUF開源多模態模型 - 支持工業分析、3D感知等任務

首頁

Internvl3 14B Instruct GGUF

由unsloth開發

InternVL3-14B-Instruct 是一個先進的多模態大語言模型（MLLM），展示了卓越的多模態感知和推理能力，支持工具使用、GUI代理、工業圖像分析、3D視覺感知等多種任務。

圖像生成文本

Transformers

開源協議:Apache-2.0 #多模態推理 #原生預訓練 #長上下文理解

下載量 982

發布時間 : 5/19/2025

模型概述

InternVL3-14B-Instruct 是基於 Qwen2.5-14B 語言模型微調的多模態大語言模型，具備強大的圖像理解和文本生成能力，適用於複雜的多模態任務。

模型特點

原生多模態預訓練

將語言和視覺學習整合到一個預訓練階段，增強多模態表示能力。

可變視覺位置編碼（V2PE）

使用更小、更靈活的位置增量處理視覺標記，提升長上下文理解能力。

混合偏好優化（MPO）

通過正負樣本監督對齊模型響應分佈，提高推理性能。

動態分辨率支持

支持多圖像和視頻數據輸入，適應不同分辨率的視覺任務。

模型能力

圖像理解

文本生成

多模態推理

工具使用

GUI代理

3D視覺感知

視頻理解

OCR和文檔分析

使用案例

工業應用

工業圖像分析

用於檢測和分析工業場景中的圖像數據。

提升檢測精度和效率。

教育

多模態教學助手

結合圖像和文本生成教學內容。

提供更直觀的學習體驗。

創意

創意寫作

基於圖像生成詩歌或故事。

激發創意靈感。

🚀 InternVL3-14B-Instruct

InternVL3-14B-Instruct是一款先進的多模態大語言模型，在多模態感知、推理和語言處理等方面表現出色，拓展了多模態能力的應用範圍。

🚀 快速開始

我們提供了使用transformers庫運行InternVL3-14B的示例代碼。

⚠️ 重要提示

請使用transformers>=4.37.2以確保模型正常工作。

模型加載

16位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-14B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-14B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

多GPU使用

以下代碼的編寫方式是為了避免在多GPU推理期間由於張量不在同一設備上而出現錯誤。通過確保大語言模型（LLM）的第一層和最後一層在同一設備上，我們可以防止此類錯誤。

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-14B"
device_map = split_model('InternVL3-14B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

使用Transformers進行推理

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-14B'
device_map = split_model('InternVL3-14B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (純文本對話)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (單圖像單輪對話)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (單圖像多輪對話)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多圖像多輪對話，組合圖像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多圖像多輪對話，分離圖像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (單圖像批量推理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (視頻多輪對話)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式輸出

除了上述方法，你還可以使用以下代碼實現流式輸出。

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主要特性

先進的多模態能力：與InternVL 2.5相比，InternVL3展現出更出色的多模態感知和推理能力，並將多模態能力擴展到工具使用、GUI代理、工業圖像分析、3D視覺感知等領域。
原生多模態預訓練：提出原生多模態預訓練方法，將語言和視覺學習整合到一個預訓練階段，使模型能同時學習語言和多模態表示，增強處理視覺 - 語言任務的能力。
更好的長上下文理解：集成可變視覺位置編碼（V2PE），利用更小、更靈活的位置增量處理視覺標記，使InternVL3在長上下文理解方面表現更優。
超越Qwen2.5的文本性能：得益於原生多模態預訓練，InternVL3系列在整體文本性能上優於Qwen2.5系列。

📚 詳細文檔

模型介紹

這是InternVL3-14B的SFT版本，經過了原生多模態預訓練和SFT，但未經過MPO。如果你不確定使用哪個版本，請使用InternVL3-14B版本。

InternVL3是一系列先進的多模態大語言模型（MLLM），整體性能優越。與InternVL 2.5相比，InternVL3在多模態感知和推理能力上表現更出色，並且進一步拓展了多模態能力，涵蓋工具使用、GUI代理、工業圖像分析、3D視覺感知等領域。

InternVL3家族

以下表格概述了InternVL3系列：

模型名稱	視覺部分	語言部分	Hugging Face鏈接
InternVL3-1B	InternViT-300M-448px-V2_5	Qwen2.5-0.5B	鏈接
InternVL3-2B	InternViT-300M-448px-V2_5	Qwen2.5-1.5B	鏈接
InternVL3-8B	InternViT-300M-448px-V2_5	Qwen2.5-7B	鏈接
InternVL3-9B	InternViT-300M-448px-V2_5	internlm3-8b-instruct	鏈接
InternVL3-14B	InternViT-300M-448px-V2_5	Qwen2.5-14B	鏈接
InternVL3-38B	InternViT-6B-448px-V2_5	Qwen2.5-32B	鏈接
InternVL3-78B	InternViT-6B-448px-V2_5	Qwen2.5-72B	鏈接

模型架構

如下圖所示，InternVL3保留了與InternVL 2.5及其前身InternVL 1.5和2.0相同的模型架構，遵循“ViT - MLP - LLM”範式。在這個新版本中，我們使用隨機初始化的MLP投影器，將新的增量預訓練的InternViT與各種預訓練的LLM（包括InternLM 3和Qwen 2.5）集成在一起。

與之前的版本一樣，我們應用了像素重排操作，將視覺標記的數量減少到原來的四分之一。此外，我們採用了與InternVL 1.5類似的動態分辨率策略，將圖像劃分為448×448像素的圖塊。從InternVL 2.0開始，關鍵的區別在於我們還增加了對多圖像和視頻數據的支持。

值得注意的是，在InternVL3中，我們集成了可變視覺位置編碼（V2PE），它為視覺標記使用更小、更靈活的位置增量。得益於V2PE，InternVL3與前代模型相比，表現出更好的長上下文理解能力。

訓練策略

原生多模態預訓練

我們提出了一種原生多模態預訓練方法，將語言和視覺學習整合到一個預訓練階段。與先訓練純語言模型，然後將其適應處理其他模態的標準範式不同，我們的方法將多模態數據（如圖像 - 文本、視頻 - 文本或圖像 - 文本交錯序列）與大規模文本語料庫交織在一起。這種統一的訓練方案使模型能夠同時學習語言和多模態表示，最終增強其處理視覺 - 語言任務的能力，而無需單獨的對齊或橋接模塊。更多細節請參閱我們的論文。

監督微調

在這個階段，InternVL2.5中提出的隨機JPEG壓縮、平方損失重新加權和多模態數據打包技術也被應用於InternVL3系列。與InternVL2.5相比，InternVL3的SFT階段的主要進步在於使用了更高質量和更多樣化的訓練數據。具體來說，我們進一步擴展了工具使用、3D場景理解、GUI操作、長上下文任務、視頻理解、科學圖表、創意寫作和多模態推理的訓練樣本。

混合偏好優化

在預訓練和SFT期間，模型根據先前的真實標記來預測下一個標記。然而，在推理期間，模型根據自己的先前輸出預測每個標記。真實標記和模型預測標記之間的這種差異引入了分佈偏移，這可能會損害模型的思維鏈（CoT）推理能力。為了緩解這個問題，我們採用了MPO，它引入了來自正樣本和負樣本的額外監督，以使模型響應分佈與真實分佈對齊，從而提高推理性能。具體來說，MPO的訓練目標是偏好損失 $\mathcal{L}{\text{p}}$、質量損失 $\mathcal{L}{\text{q}}$ 和生成損失 $\mathcal{L}_{\text{g}}$ 的組合，可以表述如下：

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$

其中 $w_{*}$ 表示每個損失分量的權重。有關MPO的更多細節，請參閱我們的論文。

測試時縮放

測試時縮放已被證明是增強LLM和MLLM推理能力的有效方法。在這項工作中，我們使用Best - of - N評估策略，並採用[VisualPRM - 8B](https://huggingface.co/OpenGVLab/VisualPRM - 8B)作為評估模型，以選擇最佳響應進行推理和數學評估。

多模態能力評估

多模態推理和數學
OCR、圖表和文檔理解
多圖像和現實世界理解
綜合多模態和幻覺評估
視覺定位
多模態多語言理解
視頻理解
GUI定位
空間推理

語言能力評估

我們將InternVL3與Qwen2.5聊天模型進行了比較，Qwen2.5的相應預訓練基礎模型被用作InternVL3中語言組件的初始化。得益於原生多模態預訓練，InternVL3系列在整體文本性能上甚至優於Qwen2.5系列。請注意，Qwen2.5系列的評估分數可能與官方報告的不同，因為我們在所有數據集上採用了表中提供的提示版本進行OpenCompass評估。

語言能力評估

消融研究

原生多模態預訓練

我們在InternVL2 - 8B模型上進行了實驗，同時保持其架構、初始化參數和訓練數據完全不變。傳統上，InternVL2 - 8B採用的訓練流程是先進行MLP預熱階段以進行特徵對齊，然後進行指令調優階段。在我們的實驗中，我們用原生多模態預訓練過程取代了傳統的MLP預熱階段。這種修改隔離了原生多模態預訓練對模型整體多模態能力的貢獻。

下圖的評估結果表明，經過原生多模態預訓練的模型在大多數基準測試中的性能與經過完整多階段訓練的InternVL2 - 8B基線相當。此外，當在更高質量的數據上進行指令調優時，該模型在評估的多模態任務中表現出進一步的性能提升。這些發現強調了原生多模態預訓練在賦予MLLM強大多模態能力方面的效率。

原生多模態預訓練消融研究

混合偏好優化

如下表所示，與未使用MPO的模型相比，使用MPO進行微調的模型在七個多模態推理基準測試中表現出更優越的推理性能。具體來說，InternVL3 - 78B和InternVL3 - 38B分別比其對應模型高出4.1和4.5分。值得注意的是，用於MPO的訓練數據是用於SFT的訓練數據的子集，這表明性能提升主要源於訓練算法而非訓練數據。

混合偏好優化消融研究

可變視覺位置編碼

如下表所示，引入V2PE在大多數評估指標上帶來了顯著的性能提升。此外，我們的消融研究通過改變位置增量 $ \delta $ 揭示，即使對於主要涉及傳統上下文的任務，相對較小的 $ \delta $ 值也可以實現最佳性能。這些發現為未來改進MLLM中視覺標記的位置編碼策略提供了重要見解。

可變視覺位置編碼消融研究

🔧 技術細節

模型加載

在不同的硬件環境下，我們提供了多種模型加載方式，包括16位（bf16 / fp16）加載、BNB 8位量化加載以及多GPU加載。通過合理選擇加載方式，可以充分利用硬件資源，提高模型的運行效率。

推理過程

在推理過程中，我們使用了一系列的圖像處理和數據預處理方法，如構建圖像變換、尋找最接近的寬高比、動態預處理圖像等。這些方法確保了模型能夠準確地處理不同類型的輸入數據，包括單圖像、多圖像和視頻數據。

流式輸出

通過使用TextIteratorStreamer和多線程技術，我們實現了模型的流式輸出。這種方式可以在生成文本的過程中即時顯示結果，提高用戶體驗。

💻 使用示例

基礎用法

# 模型加載和推理示例
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-14B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

高級用法

# 多圖像多輪對話示例
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

📦 安裝指南

LMDeploy安裝

# 如果lmdeploy<0.7.3，你需要顯式設置chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

OpenAI庫安裝

pip install openai

📄 許可證

本項目採用MIT許可證發佈。本項目使用預訓練的Qwen2.5作為組件，該組件遵循Apache 2.0許可證。

引用

如果你在研究中發現這個項目有用，請考慮引用：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}