InternVL3 - 2B - hfオープンソースマルチモーダル大規模モデル - 画像、動画、テキストを無料で処理、推論が効率的

Internvl3 2B Hf

Developed by OpenGVLab

InternVL3-2BはHugging Face Transformersライブラリに基づいて実装されたマルチモーダル大規模言語モデルで、画像、ビデオ、テキスト処理などのマルチモーダルタスクで優れた性能を発揮し、さまざまな入力方式と効率的なバッチ推論をサポートします。

画像生成テキスト

Transformers

OtherOpen Source License:Other #マルチモーダル大規模モデル #画像テキスト生成 #ビデオ理解

Downloads 41.22k

Release Time : 4/18/2025

Model Overview

InternVL3-2Bは先進的なマルチモーダル大規模言語モデルで、画像、ビデオ、テキストの交差入力処理をサポートし、強力なマルチモーダル感知と推論能力を備え、さまざまなビジュアル - 言語タスクに適しています。

Model Features

マルチモーダル処理能力

画像、ビデオ、テキストの交差入力処理をサポートし、真のマルチモーダル理解を実現します。

バッチ推論サポート

大量の画像とテキスト入力を効率的に処理し、推論効率を向上させます。

先進的なマルチモーダル事前学習

ネイティブなマルチモーダル事前学習により、テキスト性能で純粋な言語モデルを上回ることさえあります。

拡張アプリケーション分野

ツール使用、GUIエージェント、産業用画像分析、3Dビジュアル感知などの拡張アプリケーションをサポートします。

Model Capabilities

画像説明生成

ビデオ内容理解

マルチモーダル対話

クロスモーダル推論

テキスト生成

多言語処理

バッチマルチ画像処理

Use Cases

内容理解と生成

画像説明生成

入力画像に対して詳細な説明を生成します。

画像内の物体、シーン、関係を正確に説明することができます。

ビデオ内容分析

ビデオ内容を理解し、質問に答えます。

ビデオ内の動作やイベントを識別することができます。

クリエイティブアプリケーション

画像に触発された詩の創作

画像内容に基づいて詩を創作します。

画像の雰囲気に合った詩を生成することができます。

教育アプリケーション

ランドマーク識別と説明

画像内の著名なランドマークを識別し、説明します。

複数の著名なランドマークを正確に識別し、説明することができます。

🚀 InternVL3-2B Transformers 🤗 の実装

InternVL3-2Bは、Hugging Face 🤗 Transformersライブラリを用いて実装された多モーダル大規模言語モデルです。画像、動画、テキスト処理などの多モーダルタスクで優れた性能を発揮し、様々な入力方式と効率的なバッチ推論をサポートしています。

🚀 クイックスタート

このリポジトリには、OpenGVLab/InternVL3-2B モデルのHugging Face 🤗 Transformers実装が含まれています。機能的にはOpenGVLabのオリジナルバージョンと同等です。Transformersモデルとして、コアライブラリの様々な機能をサポートしており、異なる注意力機構の実装（SDPAやFA2を含む）や、画像、動画、テキストの入れ子による効率的なバッチ推論が可能です。

✨ 主な機能

私たちはInternVL3という一連の先進的な多モーダル大規模言語モデル（MLLM）を発表しました。これらのモデルは卓越した総合性能を示しています。InternVL 2.5と比較すると、InternVL3は多モーダル感知と推論能力において優れており、ツール使用、GUIエージェント、産業画像分析、3D視覚感知などの分野でも多モーダル能力を拡張しています。

さらに、InternVL3をQwen2.5チャットモデルと比較しました。InternVL3の言語コンポーネントはQwen2.5の対応する事前学習モデルを使用して初期化されています。ネイティブな多モーダル事前学習のおかげで、InternVL3シリーズは総合的なテキスト性能においてQwen2.5シリーズを上回っています。

image/png

InternVL3シリーズに関する詳細情報は、オリジナルのチェックポイント OpenGVLab/InternVL3-2B で確認できます。

💻 使用例

基本的な使用法

パイプラインを使用した推論

以下は、image-text-to-text パイプラインを使用して、数行のコードで InternVL3 モデルを推論する例です。

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-2B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'

単一画像の推論

この例は、チャットテンプレートを使用してInternVLモデルで単一画像を推論する方法を示しています。

⚠️ 重要な注意

このモデルは特定のチャットプロンプト形式で学習されています。processor.apply_chat_template(my_conversation_dict) を使用して、プロンプトを正しくフォーマットしてください。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Please describe the image explicitly."},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'

純粋なテキスト生成

この例は、画像入力を提供せずにInternVLモデルを使用してテキストを生成する方法を示しています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "text", "text": "Write a haiku"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> print(decoded_output)
"Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."

高度な使用法

バッチ画像とテキスト入力

InternVLモデルは、バッチ画像とテキスト入力もサポートしています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']

バッチ多画像入力

InternVLモデルのこの実装は、各テキストに対して異なる数の画像を持つバッチテキスト - 画像入力をサポートしています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']

動画入力

InternVLモデルは動画入力も処理できます。以下は、チャットテンプレートを使用して動画入力を推論する例です。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "video",
...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
...             },
...             {"type": "text", "text": "What type of shot is the man performing?"},
...         ],
...     }
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     return_tensors="pt",
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
>>> ).to(model.device, dtype=torch.float16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'

入れ子の画像と動画入力

この例は、チャットテンプレートを使用して、入れ子の画像と動画入力を含むバッチチャット会話を処理する方法を示しています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
...                 {"type": "text", "text": "What type of shot is the man performing?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     padding=True,
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)

>>> outputs = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
 "user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace."]

📄 ライセンス

このプロジェクトはMITライセンスの下で公開されています。このプロジェクトでは事前学習されたQwen2.5をコンポーネントとして使用しており、このコンポーネントはQwenライセンスに従います。

📚 ドキュメント

引用

もしあなたの研究でこのプロジェクトが有用であることがわかった場合は、以下のように引用を検討してください。

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

情報テーブル

属性	詳細
モデルタイプ	画像 - テキストからテキストへのモデル
学習データ	OpenGVLab/MMPR-v1.2
ベースモデル	OpenGVLab/InternVL3-2B-Instruct
ベースモデルの関係	微調整
言語	多言語
ラベル	internvl