InternVL3-2Bオープンソース多モーダル大規模モデル - 視覚言語理解、推論、多タスク処理を無料でサポート

ホーム

Internvl3 2B Pretrained

OpenGVLabによって開発

InternVL3-2BはOpenGVLabが開発した先進的なマルチモーダル大規模言語モデルで、強力な視覚言語理解と推論能力を備え、様々なマルチモーダルタスクをサポートします。

テキスト生成画像

Transformers

その他オープンソースライセンス:Apache-2.0 #ネイティブマルチモーダル事前学習 #可変視覚位置エンコーディング #複数画像・動画理解

ダウンロード数 61

リリース時間 : 4/17/2025

モデル概要

InternVL3-2BはQwen2.5-1.5BとInternViT-300M-448px-V2_5を統合したマルチモーダル大規模言語モデルで、ネイティブマルチモーダル事前学習を完了し、優れた総合性能を発揮します。

モデル特徴

ネイティブマルチモーダル事前学習

言語と視覚学習を単一の事前学習段階に統合し、マルチモーダル表現能力を強化

可変視覚位置エンコーディング(V2PE)

より小さく柔軟な位置増分を使用し、長文脈理解能力を向上

混合選好最適化(MPO)

正負サンプルの監視によりモデル応答分布を調整し、推論性能を向上

動的解像度処理

448×448ピクセルのタイル分割をサポートし、様々なサイズの入力に対応

モデル能力

マルチモーダル推論

画像説明生成

文書理解

複数画像分析

動画理解

GUI位置特定

空間推論

多言語理解

使用事例

視覚コンテンツ分析

画像説明生成

入力画像の詳細な説明を生成

高品質な自然言語説明

複数画像比較

複数画像の類似点と相違点を分析

正確な比較分析結果

産業応用

産業画像分析

産業シーンにおける画像データを分析

正確な欠陥検出と分類

インタラクティブアプリケーション

GUIエージェント

グラフィカルユーザーインターフェースを理解し操作

正確なインターフェース要素認識と操作

🚀 InternVL3-2B-Pretrained

これはInternVL3-2Bの事前学習バージョンで、ネイティブなマルチモーダル事前学習を行っていますが、事後学習（SFTやMPO）は行っていません。どのバージョンを使用するかわからない場合は、InternVL3-2Bバージョンを使用してください。

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

🚀 クイックスタート

モデルの読み込み

16-bit (bf16 / fp16)

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-2B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8-bit Quantization

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-2B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

複数GPU

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-2B"
device_map = split_model('InternVL3-2B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

Transformersを使用した推論

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-2B'
device_map = split_model('InternVL3-2B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (純文本対話)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (単図単輪対話)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (単図多輪対話)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多図多輪対話，拼接画像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多図多輪対話，独立画像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (単図バッチ処理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (動画多輪対話)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

ストリーミング出力

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主な機能

概要

これはInternVL3-2Bの事前学習バージョンで、ネイティブなマルチモーダル事前学習を行っていますが、事後学習（SFTやMPO）は行っていません。InternVL3は、優れた総合性能を示す高度なマルチモーダル大規模言語モデル（MLLM）シリーズです。InternVL 2.5と比較して、InternVL3は優れたマルチモーダル知覚と推論能力を示し、マルチモーダル機能をツール使用、GUIエージェント、産業画像分析、3Dビジョン知覚などに拡張しています。

モデルアーキテクチャ

InternVL3は、InternVL 2.5 とその前身であるInternVL 1.5および2.0と同じモデルアーキテクチャを保持しており、「ViT-MLP-LLM」パラダイムに従っています。この新しいバージョンでは、新しく増分事前学習されたInternViTを、InternLM 3やQwen 2.5などのさまざまな事前学習済みLLMと、ランダムに初期化されたMLPプロジェクターを使用して統合しています。

学習戦略

ネイティブマルチモーダル事前学習

言語とビジョンの学習を1つの事前学習段階に統合するネイティブマルチモーダル事前学習アプローチを提案しています。標準的なパラダイムでは、最初に言語のみのモデルを学習させ、その後に追加のモダリティを扱うように適応させますが、当社の方法では、マルチモーダルデータ（画像-テキスト、ビデオ-テキスト、または画像-テキストの交互シーケンスなど）を大規模なテキストコーパスとインターリーブします。この統一された学習スキームにより、モデルは言語表現とマルチモーダル表現の両方を同時に学習でき、別々のアライメントまたはブリッジングモジュールを必要とせずに、ビジョン-言語タスクを処理する能力が向上します。

教師付き微調整

このフェーズでは、InternVL2.5で提案されたランダムJPEG圧縮、二乗損失の再重み付け、およびマルチモーダルデータのパッキングの技術も、InternVL3シリーズで使用されています。InternVL3のSFTフェーズの主な進歩は、InternVL2.5と比較して、より高品質で多様な学習データの使用です。具体的には、ツール使用、3Dシーン理解、GUI操作、長文脈タスク、ビデオ理解、科学図、創造的な執筆、およびマルチモーダル推論の学習サンプルをさらに拡張しています。

混合嗜好最適化

事前学習とSFTの間、モデルは以前の真値トークンを条件として次のトークンを予測するように学習されます。ただし、推論中は、モデルは自身の事前出力に基づいて各トークンを予測します。この真値トークンとモデル予測トークンの不一致により、分布シフトが発生し、モデルの思考連鎖（CoT）推論能力が損なわれる可能性があります。この問題を軽減するために、MPOを採用しています。これは、正と負のサンプルの両方から追加の監督を導入して、モデル応答分布を真値分布に一致させ、推論性能を向上させます。

テスト時スケーリング

テスト時スケーリングは、LLMとMLLMの推論能力を向上させる効果的な方法であることが示されています。この研究では、Best-of-N評価戦略を使用し、VisualPRM-8Bを評価モデルとして使用して、推論と数学評価の最良の応答を選択しています。

評価

マルチモーダル能力評価

マルチモーダル推論と数学
OCR、チャート、およびドキュメント理解
多画像と現実世界の理解
包括的なマルチモーダルと幻覚評価
ビジュアルグラウンディング
マルチモーダル多言語理解
ビデオ理解
GUIグラウンディング
空間推論

言語能力評価

InternVL3をQwen2.5チャットモデルと比較しています。Qwen2.5の対応する事前学習済みベースモデルは、InternVL3の言語コンポーネントの初期化として使用されています。ネイティブマルチモーダル事前学習の恩恵を受けて、InternVL3シリーズはQwen2.5シリーズよりもさらに優れた総合テキスト性能を達成しています。

アブレーション研究

ネイティブマルチモーダル事前学習

InternVL2-8Bモデルに対して、そのアーキテクチャ、初期化パラメータ、および学習データを完全に変更せずに実験を行っています。従来、InternVL2-8Bは、特徴アライメントのためのMLPウォームアップフェーズから始まり、その後に命令微調整段階が続く学習パイプラインを採用しています。実験では、従来のMLPウォームアップフェーズをネイティブマルチモーダル事前学習プロセスに置き換えています。この変更により、ネイティブマルチモーダル事前学習がモデルの総合的なマルチモーダル能力に与える影響を分離しています。

混合嗜好最適化

下の表に示すように、MPOで微調整されたモデルは、MPOを使用しないモデルと比較して、7つのマルチモーダル推論ベンチマークで優れた推論性能を示しています。具体的には、InternVL3-78BとInternVL3-38Bは、それぞれ4.1ポイントと4.5ポイント上回っています。注目すべきは、MPOに使用される学習データはSFTに使用されるデータのサブセットであり、このことは性能向上が主に学習アルゴリズムに起因することを示しています。

可変ビジュアル位置符号化

下の表に報告されているように、V2PEの導入により、ほとんどの評価指標で大幅な性能向上が見られます。さらに、位置増分 \( \delta \) を変化させることによるアブレーション研究では、従来のコンテキストを主に含むタスクでも、比較的小さい \( \delta \) 値で最適な性能を達成できることが明らかになっています。

📦 インストール

LMDeploy

LMDeployは、LLMとVLMの圧縮、デプロイ、およびサービングのためのツールキットです。

# if lmdeploy<0.7.3, you need to explicitly set chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

💻 使用例

LMDeployの使用例

'Hello, world' の例

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-2B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

多画像推論

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-2B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

バッチプロンプト推論

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-2B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

多ターン会話

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-2B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

サービス

LMDeployのapi_serverを使用すると、モデルを単一のコマンドで簡単にサービスにパックできます。提供されるRESTful APIは、OpenAIのインターフェースと互換性があります。

lmdeploy serve api_server OpenGVLab/InternVL3-2B --chat-template internvl2_5 --server-port 23333 --tp 1

OpenAIスタイルのインターフェースを使用するには、OpenAIをインストールする必要があります。

pip install openai

その後、以下のコードを使用してAPI呼び出しを行います。

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📚 ドキュメント

詳細については、公式ドキュメントを参照してください。

🔧 技術詳細

モデルアーキテクチャ

image/png

学習戦略

ネイティブマルチモーダル事前学習

ネイティブマルチモーダル事前学習アプローチは、言語とビジョンの学習を1つの事前学習段階に統合します。

教師付き微調整

InternVL3のSFTフェーズでは、InternVL2.5で提案された技術を使用し、より高品質で多様な学習データを使用しています。

混合嗜好最適化

MPOを使用して、モデル応答分布を真値分布に一致させ、推論性能を向上させています。

テスト時スケーリング

Best-of-N評価戦略を使用し、VisualPRM-8Bを評価モデルとして使用しています。

評価

マルチモーダル能力評価

マルチモーダル推論と数学
OCR、チャート、およびドキュメント理解
多画像と現実世界の理解
包括的なマルチモーダルと幻覚評価
ビデオ理解
GUIグラウンディング
空間推論

言語能力評価

InternVL3は、ネイティブマルチモーダル事前学習の恩恵を受けて、Qwen2.5シリーズよりも優れた総合テキスト性能を達成しています。

アブレーション研究

ネイティブマルチモーダル事前学習

InternVL2-8Bモデルに対する実験では、ネイティブマルチモーダル事前学習がモデルの総合的なマルチモーダル能力に与える影響を分離しています。

混合嗜好最適化

MPOで微調整されたモデルは、MPOを使用しないモデルと比較して、7つのマルチモーダル推論ベンチマークで優れた推論性能を示しています。

可変ビジュアル位置符号化

V2PEの導入により、ほとんどの評価指標で大幅な性能向上が見られます。

📄 ライセンス

このプロジェクトはMITライセンスの下でリリースされています。このプロジェクトは、事前学習済みのQwen2.5をコンポーネントとして使用しており、これはApache-2.0ライセンスの下でライセンスされています。

引用

このプロジェクトがあなたの研究に役立った場合は、以下を引用してください。

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}