InternVL3 - 1Bオープンソースマルチモーダル大規模言語モデル - 無料でデプロイし、卓越した感知と推論を実現

ホーム

Internvl3 1B

FriendliAIによって開発

InternVL3-1BはInternVL3シリーズの10億パラメータ規模のマルチモーダル大規模言語モデルで、InternViTビジョンエンコーダーとQwen2.5言語モデルを統合し、優れたマルチモーダル知覚と推論能力を備えています。

Transformers

その他オープンソースライセンス:その他 #マルチモーダル大規模言語モデル #ネイティブマルチモーダル事前学習 #長文脈理解

ダウンロード数 71

リリース時間 : 4/12/2025

モデル概要

InternVL3-1Bは先進的なマルチモーダル大規模言語モデルで、視覚と言語処理能力を組み合わせ、画像、動画、テキストなど様々なモダリティの入力をサポートし、複雑なマルチモーダル理解と生成タスクに適しています。

モデル特徴

ネイティブマルチモーダル事前学習

言語と視覚学習を1つの事前学習段階に統合し、マルチモーダルタスク処理能力を強化。

可変視覚位置エンコーディング（V2PE）

より小さく柔軟な位置増分で視覚トークンを処理し、長文脈理解能力を向上。

混合選好最適化（MPO）

正負サンプルの監視によりモデル応答分布を調整し、推論性能を向上。

動的解像度戦略

画像を448×448ピクセルのブロックに分割し、複数画像と動画データをサポート。

モデル能力

マルチモーダル推論

画像理解

動画理解

テキスト生成

OCR

図表理解

文書理解

GUI位置特定

空間推論

使用事例

産業画像分析

産業欠陥検出

画像分析により工業製品の欠陥を識別。

高精度な欠陥識別で生産効率を向上。

3D視覚知覚

3Dシーン理解

3Dシーン内の物体と空間関係を分析。

複雑な3Dシーンを正確に理解。

ツール使用

自動化ツール操作

自然言語命令でツールを操作。

ツール使用の利便性と効率を向上。

🚀 InternVL3-1B

InternVL3は、卓越した総合性能を発揮する高度なマルチモーダル大規模言語モデル（MLLM）シリーズです。InternVL 2.5と比較して、InternVL3はより優れたマルチモーダル知覚と推論能力を示し、マルチモーダル機能をツール使用、GUIエージェント、産業画像分析、3Dビジョン知覚などにまで拡張しています。

[📂 GitHub] [📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

🚀 クイックスタート

モデルの読み込み

16-bit (bf16 / fp16)

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

BNB 8-bit 量子化

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL3-1B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

複数GPU

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL3-1B"
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

Transformersを使用した推論

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
path = 'OpenGVLab/InternVL3-1B'
device_map = split_model('InternVL3-1B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (純文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

ストリーミング出力

from transformers import TextIteratorStreamer
from threading import Thread

# Initialize the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
# Define the generation configuration
generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread = Thread(target=model.chat, kwargs=dict(
    tokenizer=tokenizer, pixel_values=pixel_values, question=question,
    history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text = ''
# Loop through the streamer to get the new text as it is generated
for new_text in streamer:
    if new_text == model.conv_template.sep:
        break
    generated_text += new_text
    print(new_text, end='', flush=True)  # Print each new chunk of generated text on the same line

✨ 主な機能

高度なマルチモーダル性能：InternVL3は、InternVL 2.5と比較して、卓越したマルチモーダル知覚と推論能力を備えています。
拡張されたマルチモーダル機能：ツール使用、GUIエージェント、産業画像分析、3Dビジョン知覚などの新機能をサポートしています。
長文脈理解能力：Variable Visual Position Encoding (V2PE) を採用することで、長文脈の理解能力が向上しています。

📚 ドキュメント

モデルアーキテクチャ

InternVL3 は、InternVL 2.5 と同じモデルアーキテクチャを保持しており、"ViT-MLP-LLM" パラダイムに従っています。この新しいバージョンでは、新しく増分事前学習されたInternViTを、InternLM 3やQwen 2.5などの様々な事前学習済みLLMと、ランダムに初期化されたMLPプロジェクターを使用して統合しています。

image/png

前のバージョンと同様に、ピクセルのアンシャッフル操作を適用し、ビジュアルトークンの数を元の4分の1に減らしています。また、InternVL 1.5と同様の動的解像度戦略を採用し、画像を448×448ピクセルのタイルに分割しています。InternVL 2.0からの主な違いは、マルチ画像とビデオデータのサポートを追加したことです。

特に、InternVL3では Variable Visual Position Encoding (V2PE) を統合しており、ビジュアルトークンに対してより小さく柔軟な位置増分を使用しています。V2PEの恩恵を受けて、InternVL3は以前のバージョンと比較して、より優れた長文脈理解能力を示しています。

学習戦略

ネイティブマルチモーダル事前学習

ネイティブマルチモーダル事前学習アプローチを提案しています。このアプローチでは、言語学習とビジョン学習を1つの事前学習段階に統合しています。標準的なパラダイムでは、最初に言語のみのモデルを学習し、その後に追加のモダリティを扱うように適応させますが、私たちの方法では、マルチモーダルデータ（画像テキスト、ビデオテキスト、または画像テキストの交互シーケンスなど）を大規模なテキストコーパスと交互に使用します。この統一された学習スキームにより、モデルは言語表現とマルチモーダル表現を同時に学習し、別々のアライメントまたはブリッジモジュールを必要とせずに、ビジョン言語タスクを処理する能力を向上させます。詳細については、私たちの論文を参照してください。

教師付き微調整

この段階では、InternVL2.5 で提案されたランダムJPEG圧縮、二乗損失の再重み付け、およびマルチモーダルデータのパッキングの技術も、InternVL3シリーズで使用されています。InternVL3のSFT段階の主な進歩は、InternVL2.5と比較して、より高品質で多様な学習データを使用していることです。具体的には、ツール使用、3Dシーン理解、GUI操作、長文脈タスク、ビデオ理解、科学図、創造的な文章作成、およびマルチモーダル推論の学習サンプルをさらに拡張しています。

混合嗜好最適化

事前学習とSFTの間、モデルは以前の正解トークンを条件として次のトークンを予測するように学習されます。しかし、推論中は、モデルは自身の事前出力に基づいて各トークンを予測します。この正解トークンとモデル予測トークンの不一致により、分布シフトが発生し、モデルの思考連鎖（CoT）推論能力が損なわれる可能性があります。この問題を軽減するために、MPO を採用しています。これは、正と負のサンプルから追加の監督を導入し、モデルの応答分布を正解分布に合わせることで、推論性能を向上させます。具体的には、MPOの学習目的は、嗜好損失 $\mathcal{L}{\text{p}}$、品質損失 $\mathcal{L}{\text{q}}$、および生成損失 $\mathcal{L}_{\text{g}}$ の組み合わせであり、次のように定式化されます。

$$ \mathcal{L}=w_{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}_{\text{g}}, $$

ここで、$w_{*}$ は各損失成分に割り当てられた重みを表します。MPOの詳細については、私たちの論文を参照してください。

テスト時スケーリング

テスト時スケーリングは、LLMとMLLMの推論能力を向上させる効果的な方法であることが示されています。この研究では、Best-of-N評価戦略を使用し、VisualPRM-8B を評価モデルとして使用して、推論と数学評価のための最良の応答を選択しています。

マルチモーダル能力の評価

マルチモーダル推論と数学
OCR、チャート、およびドキュメント理解
マルチ画像と現実世界の理解
包括的なマルチモーダルと幻覚評価
ビジュアルグラウンディング
マルチモーダル多言語理解
ビデオ理解
GUIグラウンディング
空間推論

言語能力の評価

InternVL3は、InternVL3の言語コンポーネントの初期化として使用される対応する事前学習済みベースモデルを持つQwen2.5 Chatモデルと比較されています。ネイティブマルチモーダル事前学習の恩恵を受けて、InternVL3シリーズはQwen2.5シリーズよりも優れた全体的なテキスト性能を達成しています。Qwen2.5シリーズの評価スコアは、公式に報告されたものと異なる場合があります。なぜなら、私たちはすべてのデータセットにわたって表に提供されたプロンプトバージョンを使用してOpenCompass評価を行っているからです。

image/png

アブレーション研究

ネイティブマルチモーダル事前学習

InternVL2-8Bモデルに対して、そのアーキテクチャ、初期化パラメータ、および学習データを完全に変更せずに実験を行いました。従来、InternVL2-8Bは、特徴アライメントのためのMLPウォームアップフェーズから始まり、その後に命令微調整段階を行う学習パイプラインを採用しています。私たちの実験では、従来のMLPウォームアップフェーズをネイティブマルチモーダル事前学習プロセスに置き換えました。この変更により、ネイティブマルチモーダル事前学習がモデルの全体的なマルチモーダル能力に与える影響を分離することができました。

下の図の評価結果は、ネイティブマルチモーダル事前学習を行ったモデルが、ほとんどのベンチマークで、完全な多段階学習を行ったInternVL2-8Bベースラインと同等の性能を示していることを示しています。さらに、より高品質のデータで命令微調整を行った場合、モデルは評価されたマルチモーダルタスク全体でさらなる性能向上を示します。これらの結果は、ネイティブマルチモーダル事前学習がMLLMに強力なマルチモーダル能力を付与する効率性を強調しています。

image/png

混合嗜好最適化

下の表に示すように、MPOで微調整されたモデルは、MPOを使用しないモデルと比較して、7つのマルチモーダル推論ベンチマークで優れた推論性能を示しています。具体的には、InternVL3-78BとInternVL3-38Bは、それぞれ4.1ポイントと4.5ポイント上回っています。注目すべきは、MPOに使用される学習データはSFTに使用されるデータのサブセットであり、この性能向上は主に学習アルゴリズムに起因することを示しています。

image/png

可変ビジュアル位置符号化

下の表に報告されているように、V2PEの導入により、ほとんどの評価指標で大幅な性能向上が見られます。さらに、私たちのアブレーション研究では、位置増分 $ \delta $ を変化させることで、従来の文脈を主に含むタスクでも、比較的小さい $ \delta $ 値で最適な性能を達成できることが明らかになりました。これらの結果は、MLLMのビジュアルトークンの位置符号化戦略を改良するための重要な洞察を提供しています。

image/png

🔧 技術詳細

モデル構成

属性	详情
模型类型	图像文本到文本
训练数据	OpenGVLab/MMPR-v1.2
基础模型	OpenGVLab/InternViT-300M-448px-V2_5、Qwen/Qwen2.5-0.5B
基础模型关系	合并
语言支持	多语言

注意事项

⚠️ 重要提示

transformers のバージョンは 4.37.2 以上を使用してください。そうしないと、モデルが正常に動作しない可能性があります。

💡 使用建议

複数のGPUを使用する場合、エラーを避けるために、大規模言語モデル（LLM）の最初と最後のレイヤーが同じデバイス上にあることを確認してください。

📦 インストール

LMDeployのインストール

# if lmdeploy<0.7.3, you need to explicitly set chat_template_config=ChatTemplateConfig(model_name='internvl2_5')
pip install lmdeploy>=0.7.3

OpenAIのインストール

pip install openai

💻 使用例

基礎的な使用法

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response = pipe(('describe this image', image))
print(response.text)

高度な使用法

マルチ画像推論

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image
from lmdeploy.vl.constants import IMAGE_TOKEN

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
    'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images = [load_image(img_url) for img_url in image_urls]
# Numbering images improves multi-image conversations
response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
print(response.text)

バッチプロンプト推論

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
    "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)

マルチターン会話

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, ChatTemplateConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL3-1B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

サービス

lmdeploy serve api_server OpenGVLab/InternVL3-1B --chat-template internvl2_5 --server-port 23333 --tp 1

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'describe this image',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

📄 ライセンス

このプロジェクトはMITライセンスの下で公開されています。このプロジェクトでは、事前学習済みのQwen2.5をコンポーネントとして使用しており、これはQwenライセンスの下でライセンスされています。

引用

このプロジェクトがあなたの研究に役立った場合、以下のように引用を考慮してください。

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

おすすめAIモデル

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2％です。

対話システム

Transformers 英語