InternVL3-1B-Instruct开源模型 - 多模态感知推理，原生预训练功能卓越

首页

Internvl3 1B Instruct

由 OpenGVLab 开发

InternVL3-1B-Instruct 是 InternVL3 系列的监督微调版本，基于原生多模态预训练，具备卓越的多模态感知和推理能力。

图像生成文本

Transformers

其他开源协议:Apache-2.0 #原生多模态预训练 #多语言多模态推理 #GUI代理工具

下载量 705

发布时间 : 4/16/2025

模型简介

InternVL3-1B-Instruct 是一个先进的多模态大语言模型，支持图像、文本、视频等多种模态的联合理解与推理，适用于复杂的多模态任务。

模型特点

原生多模态预训练

将语言和视觉学习整合到单一的预训练阶段，增强多模态表示能力。

可变视觉位置编码（V2PE）

使用更小、更灵活的位置增量表示视觉标记，提升长上下文理解能力。

动态分辨率策略

将图像划分为 448×448 像素的图块，支持多图像和视频数据。

混合偏好优化（MPO）

通过正负样本的额外监督，提高模型的推理性能。

模型能力

多模态推理

图像理解

文本生成

视频理解

OCR

图表理解

文档理解

GUI 定位

空间推理

使用案例

多模态推理

复杂问题解答

结合图像和文本信息进行复杂问题的推理和解答。

在多项基准测试中表现优异。

文档理解

文档内容提取

从扫描文档或图像中提取文本和结构化信息。

支持高质量的 OCR 和文档分析。

GUI 操作

界面自动化

理解并操作图形用户界面（GUI）。

可用于自动化测试和辅助工具开发。

🚀 InternVL3-1B-Instruct

InternVL3-1B-Instruct 是 InternVL3 系列的指令微调版本，这是一个先进的多模态大语言模型（MLLM），在多模态感知、推理等能力上表现出色，还拓展到工具使用、GUI 代理等更多领域。

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

✨ 主要特性

多模态能力卓越：相比 InternVL 2.5，InternVL3 在多模态感知和推理能力上更出色，还拓展到工具使用、GUI 代理、工业图像分析、3D 视觉感知等领域。
文本性能优秀：通过原生多模态预训练，InternVL3 系列在整体文本性能上比 Qwen2.5 系列更优。
长上下文理解能力强：集成了可变视觉位置编码（V2PE），使用更小、更灵活的位置增量处理视觉标记，提升了长上下文理解能力。

📦 安装指南

使用 `transformers` 库运行 `InternVL3-1B`

# 请使用 transformers>=4.37.2 以确保模型正常工作
pip install transformers>=4.37.2

安装 `lmdeploy` 进行部署

# 如果 lmdeploy<0.7.3，需要显式设置 chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

安装 `openai` 以使用 OpenAI 风格接口

pip install openai

💻 使用示例

基础用法

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

高级用法

多 GPU 推理

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # 由于第一个 GPU 将用于 ViT，将其视为半个 GPU
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-1B"
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

推理示例

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # 计算现有图像的宽高比
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # 找到最接近目标的宽高比
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # 计算目标宽度和高度
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # 调整图像大小
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # 分割图像
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # 由于第一个 GPU 将用于 ViT，将其视为半个 GPU
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# 如果你设置 `load_in_8bit=True`，你需要两个 80GB 的 GPU。
# 如果你设置 `load_in_8bit=False`，你至少需要三个 80GB 的 GPU。
path = 'OpenGVLab/InternVL3-1B'
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# 设置 `max_num` 中的最大图块数
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# 纯文本对话
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# 单图单轮对话
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# 单图多轮对话
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# 多图多轮对话，拼接图像
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# 多图多轮对话，独立图像
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# 单图批处理
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# 视频多轮对话
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

流式输出

from transformers import TextIteratorStreamer
from threading import Thread

# 初始化流处理器
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# 定义生成配置
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# 在单独的线程中启动模型对话
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# 初始化一个空字符串来存储生成的文本
generated_text = ''
# 遍历流处理器以获取生成的新文本
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # 在同一行打印每个新生成的文本块

📚 详细文档

InternVL3 家族

模型名称	视觉部分	语言部分	Hugging Face 链接
InternVL3-1B	InternViT-300M-448px-V2_5	Qwen2.5-0.5B	🤗 link
InternVL3-2B	InternViT-300M-448px-V2_5	Qwen2.5-1.5B	🤗 link
InternVL3-8B	InternViT-300M-448px-V2_5	Qwen2.5-7B	🤗 link
InternVL3-9B	InternViT-300M-448px-V2_5	internlm3-8b-instruct	🤗 link
InternVL3-14B	InternViT-300M-448px-V2_5	Qwen2.5-14B	🤗 link
InternVL3-38B	InternViT-6B-448px-V2_5	Qwen2.5-32B	🤗 link
InternVL3-78B	InternViT-6B-448px-V2_5	Qwen2.5-72B	🤗 link

模型架构

InternVL3 保留了与 InternVL 2.5 及其前身（InternVL 1.5 和 2.0）相同的模型架构，遵循“ViT-MLP-LLM”范式。在新版本中，使用随机初始化的 MLP 投影器，将新的增量预训练的 InternViT 与各种预训练的大语言模型（LLM）集成，包括 InternLM 3 和 Qwen 2.5。

模型架构图

训练策略

原生多模态预训练

提出了一种原生多模态预训练方法，将语言和视觉学习整合到一个预训练阶段。与先训练纯语言模型再适应其他模态的标准范式不同，该方法将多模态数据（如图文、视频文本或图文交错序列）与大规模文本语料交织。这种统一的训练方案使模型能够同时学习语言和多模态表示，最终增强其处理视觉语言任务的能力，无需单独的对齐或桥接模块。

监督微调

在这个阶段，采用了 InternVL2.5 中提出的随机 JPEG 压缩、平方损失重新加权和多模态数据打包技术。与 InternVL2.5 相比，InternVL3 的监督微调阶段的主要进步在于使用了更高质量和更多样化的训练数据。

混合偏好优化

在预训练和监督微调期间，模型基于先前的真实标记预测下一个标记。然而，在推理期间，模型根据自己的先前输出预测每个标记。这种真实标记和模型预测标记之间的差异会引入分布偏移，从而损害模型的思维链（CoT）推理能力。为缓解这个问题，采用了 MPO，它引入了来自正样本和负样本的额外监督，使模型响应分布与真实分布对齐，从而提高推理性能。

测试时缩放

测试时缩放已被证明是增强大语言模型（LLM）和多模态大语言模型（MLLM）推理能力的有效方法。在这项工作中，使用了 Best-of-N 评估策略，并采用 VisualPRM-8B 作为评估模型，为推理和数学评估选择最佳响应。

评估

多模态能力评估

包括多模态推理和数学、OCR、图表和文档理解、多图像和现实世界理解、综合多模态和幻觉评估、视觉定位、多模态多语言理解、视频理解、GUI 定位和空间推理等方面的评估。

语言能力评估

将 InternVL3 与 Qwen2.5 聊天模型进行比较，由于原生多模态预训练，InternVL3 系列在整体文本性能上比 Qwen2.5 系列更优。

消融研究

原生多模态预训练

在 InternVL2-8B 模型上进行实验，保持其架构、初始化参数和训练数据完全不变。将传统的 MLP 预热阶段替换为原生多模态预训练过程，隔离了原生多模态预训练对模型整体多模态能力的贡献。评估结果表明，经过原生多模态预训练的模型在大多数基准测试中的性能与经过完整多阶段训练的 InternVL2-8B 基线相当。

混合偏好优化

如表所示，使用 MPO 进行微调的模型在七个多模态推理基准测试中比未使用 MPO 的模型表现更优。

可变视觉位置编码

引入 V2PE 导致大多数评估指标的性能显著提升。此外，消融研究表明，即使对于主要涉及传统上下文的任务，相对较小的位置增量值也能实现最佳性能。

🔧 技术细节

模型架构

ViT-MLP-LLM 范式：InternVL3 遵循“ViT-MLP-LLM”范式，将视觉特征提取（ViT）、特征投影（MLP）和语言生成（LLM）相结合。
像素重排操作：应用像素重排操作，将视觉标记数量减少到原来的四分之一。
动态分辨率策略：采用与 InternVL 1.5 类似的动态分辨率策略，将图像分割成 448×448 像素的图块。
多图像和视频支持：从 InternVL 2.0 开始，增加了对多图像和视频数据的支持。
可变视觉位置编码（V2PE）：集成了 V2PE，使用更小、更灵活的位置增量处理视觉标记，提升了长上下文理解能力。

训练策略

原生多模态预训练：将语言和视觉学习整合到一个预训练阶段，增强了模型处理视觉语言任务的能力。
监督微调：采用随机 JPEG 压缩、平方损失重新加权和多模态数据打包技术，使用更高质量和更多样化的训练数据。
混合偏好优化（MPO）：引入额外的监督，使模型响应分布与真实分布对齐，提高推理性能。
测试时缩放：使用 Best-of-N 评估策略和 VisualPRM-8B 作为评估模型，选择最佳响应。

🚀 快速开始

模型加载

16 位（bf16 / fp16）

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8 位量化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

推理示例

# 请参考上面的使用示例部分

微调

许多仓库现在支持对 InternVL 系列模型进行微调，包括 InternVL、SWIFT、XTurner 等。请参考它们的文档以获取更多关于微调的详细信息。

部署

使用 `lmdeploy` 进行部署

一个简单示例

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

多图像推理

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# 为图像编号有助于多图像对话
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

批量提示推理

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多轮对话

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

服务部署

lmdeploy serve api_server OpenGVLab/InternVL3-1B --chat-template internvl2_5 --server-port 23333 --tp 1

使用 OpenAI 风格接口

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📄 许可证

本项目遵循 MIT 许可证。本项目使用了预训练的 Qwen2.5 作为组件，该组件遵循 Apache-2.0 许可证。

引用

如果您在研究中发现本项目有用，请考虑引用以下文献：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}