InternVL3-1Bマルチモーダル大規模言語モデル - オープンソースで無料、原生マルチモーダル事前学習が可能

ホーム

Internvl3 1B Pretrained

OpenGVLabによって開発

InternVL3-1BはOpenGVLabが開発した先進的なマルチモーダル大規模言語モデルで、ネイティブマルチモーダル事前学習を完了していますが、事後学習は行われていません。

テキスト生成画像

Transformers

その他オープンソースライセンス:その他 #ネイティブマルチモーダル事前学習 #動的解像度処理 #長文脈理解

ダウンロード数 18

リリース時間 : 4/17/2025

モデル概要

InternVL3-1BはInternViTとQwen2.5アーキテクチャに基づくマルチモーダル大規模言語モデルで、画像とテキストの統合的理解と生成タスクをサポートします。

モデル特徴

ネイティブマルチモーダル事前学習

統一されたトレーニングスキームで言語とマルチモーダル表現を同時に学習し、視覚言語タスク処理能力を強化

可変視覚位置エンコーディング(V2PE)

柔軟な位置増分処理で視覚トークンを扱い、長文脈理解能力を向上

動的解像度処理

448×448ピクセルのタイル分割をサポートし、異なるサイズの入力に対応

モデル能力

画像理解

テキスト生成

マルチモーダル推論

多言語サポート

複数画像処理

動画理解

使用事例

視覚的質問応答

画像キャプション生成

入力画像に基づいて詳細な説明を生成

マルチモーダル対話

画像ベースの対話システム

画像に基づくマルチターン対話インタラクションをサポート

🚀 InternVL3-1B-Pretrained

このプロジェクトは、高度なマルチモーダル大規模言語モデル（MLLM）であるInternVL3-1Bの事前学習バージョンを提供します。多様なマルチモーダルタスクで優れた性能を発揮します。

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

🚀 クイックスタート

transformersを使用してInternVL3-1Bを実行するサンプルコードを提供します。

⚠️ 重要提示

transformers>=4.37.2を使用して、モデルが正常に動作するようにしてください。

モデルの読み込み

16-bit (bf16 / fp16)

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8-bit Quantization

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

複数のGPUを使用する場合

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-1B"
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

Transformersを使用した推論

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-1B'
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (純文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

ストリーミング出力

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主な機能

これはInternVL3-1Bの事前学習バージョンで、ネイティブなマルチモーダル事前学習を行っていますが、事後学習（SFTやMPO）は行っていません。InternVL3は、InternVL 2.5に比べて優れたマルチモーダル知覚と推論能力を示し、マルチモーダル機能をツール使用、GUIエージェント、産業画像分析、3Dビジョン知覚などに拡張しています。また、InternVL3はQwen2.5 Chatモデルと比較して、ネイティブなマルチモーダル事前学習の恩恵を受けて、全体的なテキスト性能が向上しています。

📚 ドキュメント

モデルアーキテクチャ

InternVL3は、InternVL 2.5と同じモデルアーキテクチャを保持しており、"ViT-MLP-LLM"パラダイムに従っています。この新しいバージョンでは、新しく増分事前学習されたInternViTと、InternLM 3やQwen 2.5などのさまざまな事前学習済みLLMを、ランダムに初期化されたMLPプロジェクターを使用して統合しています。

学習戦略

ネイティブマルチモーダル事前学習

言語とビジョンの学習を1つの事前学習段階に統合するネイティブマルチモーダル事前学習アプローチを提案しています。標準的なパラダイムでは、最初に言語のみのモデルを学習させ、その後に追加のモダリティを扱うように適応させますが、当社の方法では、マルチモーダルデータ（画像テキスト、ビデオテキスト、または画像テキストの交互シーケンスなど）を大規模なテキストコーパスと交互に配置します。この統一された学習スキームにより、モデルは言語表現とマルチモーダル表現を同時に学習でき、個別のアライメントまたはブリッジモジュールを必要とせずにビジョン言語タスクを処理する能力が向上します。

教師付き微調整

このフェーズでは、InternVL2.5で提案されたランダムJPEG圧縮、二乗損失の再重み付け、およびマルチモーダルデータパッキングの手法もInternVL3シリーズで採用されています。InternVL3のSFTフェーズの主な進歩は、より高品質で多様な学習データの使用です。具体的には、ツール使用、3Dシーン理解、GUI操作、長文脈タスク、ビデオ理解、科学図、創造的な文章作成、およびマルチモーダル推論の学習サンプルをさらに拡張しています。

混合嗜好最適化

事前学習とSFTの間、モデルは以前の真値トークンを条件として次のトークンを予測するように学習されます。ただし、推論中は、モデルは自身の事前出力に基づいて各トークンを予測します。この真値トークンとモデル予測トークンの不一致により、分布シフトが発生し、モデルの思考連鎖（CoT）推論能力が損なわれる可能性があります。この問題を軽減するために、MPOを採用しています。MPOは、正例と負例の両方から追加の監督を導入して、モデルの応答分布を真値分布に合わせ、推論性能を向上させます。

テスト時スケーリング

テスト時スケーリングは、LLMやMLLMの推論能力を向上させる効果的な方法であることが示されています。この研究では、Best-of-N評価戦略を使用し、VisualPRM-8Bを評価モデルとして使用して、推論と数学評価の最良の応答を選択しています。

マルチモーダル能力の評価

InternVL3のマルチモーダル能力を、マルチモーダル推論と数学、OCR、チャートとドキュメント理解、マルチ画像と現実世界の理解、包括的なマルチモーダルと幻覚評価、ビジュアルグラウンディング、マルチモーダル多言語理解、ビデオ理解、GUIグラウンディング、空間推論などのさまざまなタスクで評価しています。

言語能力の評価

InternVL3は、Qwen2.5 Chatモデルと比較して、ネイティブなマルチモーダル事前学習の恩恵を受けて、全体的なテキスト性能が向上しています。

アブレーション研究

ネイティブマルチモーダル事前学習

InternVL2-8Bモデルに対して、アーキテクチャ、初期化パラメータ、学習データを完全に変更せずに実験を行いました。従来、InternVL2-8Bは、特徴アライメントのためのMLPウォームアップフェーズから始まり、その後に命令微調整段階が続く学習パイプラインを採用しています。実験では、従来のMLPウォームアップフェーズをネイティブマルチモーダル事前学習プロセスに置き換えました。この変更により、ネイティブマルチモーダル事前学習がモデルの全体的なマルチモーダル能力に与える影響を分離することができました。

混合嗜好最適化

MPOで微調整されたモデルは、MPOを使用しないモデルと比較して、7つのマルチモーダル推論ベンチマークで優れた推論性能を示しています。具体的には、InternVL3-78BとInternVL3-38Bは、それぞれ4.1ポイントと4.5ポイント上回っています。

可変ビジュアル位置符号化

V2PEの導入により、ほとんどの評価指標で大幅な性能向上が見られました。また、位置増分 \( \delta \) を変化させることによるアブレーション研究では、従来のコンテキストを主に扱うタスクでも、比較的小さい \( \delta \) 値で最適な性能を達成できることが明らかになりました。

🔧 技術詳細

モデルの構成

モデル名	ビジョン部分	言語部分	HFリンク
InternVL3-1B	InternViT-300M-448px-V2_5	Qwen2.5-0.5B	🤗 link
InternVL3-2B	InternViT-300M-448px-V2_5	Qwen2.5-1.5B	🤗 link
InternVL3-8B	InternViT-300M-448px-V2_5	Qwen2.5-7B	🤗 link
InternVL3-9B	InternViT-300M-448px-V2_5	internlm3-8b-instruct	🤗 link
InternVL3-14B	InternViT-300M-448px-V2_5	Qwen2.5-14B	🤗 link
InternVL3-38B	InternViT-6B-448px-V2_5	Qwen2.5-32B	🤗 link
InternVL3-78B	InternViT-6B-448px-V2_5	Qwen2.5-72B	🤗 link

ファインチューニング

多くのリポジトリがInternVLシリーズモデルのファインチューニングをサポートしており、InternVL、SWIFT、XTurnerなどがあります。ファインチューニングの詳細については、それらのドキュメントを参照してください。

デプロイメント

LMDeploy

LMDeployは、LLMとVLMの圧縮、デプロイ、サービングを行うツールキットです。

# if lmdeploy<0.7.3, you need to explicitly set chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

LMDeployは、マルチモーダルビジョン言語モデル（VLM）の複雑な推論プロセスを、大規模言語モデル（LLM）の推論パイプラインと同様に、使いやすいパイプラインに抽象化しています。

'Hello, world' の例

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

複数画像の推論

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

バッチプロンプトの推論

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

マルチターン会話

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

サービス

LMDeployのapi_serverを使用すると、モデルを簡単にサービスにパックすることができます。提供されるRESTful APIは、OpenAIのインターフェースと互換性があります。

lmdeploy serve api_server OpenGVLab/InternVL3-1B --chat-template internvl2_5 --server-port 23333 --tp 1

OpenAIスタイルのインターフェースを使用するには、OpenAIをインストールする必要があります。

pip install openai

その後、以下のコードを使用してAPIを呼び出すことができます。

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📄 ライセンス

このプロジェクトはMITライセンスの下でリリースされています。このプロジェクトは、事前学習済みのQwen2.5をコンポーネントとして使用しており、これはApache-2.0ライセンスの下でライセンスされています。

引用

このプロジェクトがあなたの研究に役立った場合は、以下のように引用してください。

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}