Internvl3 8B Hf

OpenGVLabによって開発

InternVL3は先進的なマルチモーダル大規模言語モデルシリーズで、強力なマルチモーダル知覚と推論能力を備え、画像、動画、テキスト入力をサポートします。

画像生成テキスト

Transformers

その他オープンソースライセンス:その他 #マルチモーダル大規模言語モデル #画像テキスト生成 #動画理解

ダウンロード数 454

リリース時間 : 4/18/2025

モデル概要

InternVL3はOpenGVLabが提供するマルチモーダル大規模言語モデルで、卓越した総合性能を示します。前世代と比較して、より強力なマルチモーダル知覚と推論能力を備え、ツール使用、GUIエージェント、産業画像分析、3D視覚知覚などの機能を拡張しています。

モデル特徴

マルチモーダル能力

画像、動画、テキスト入力をサポートし、強力なマルチモーダル知覚と推論能力を備えています。

拡張機能

基本的なマルチモーダル能力に加え、ツール使用、GUIエージェント、産業画像分析、3D視覚知覚などの拡張機能をサポートします。

バッチ処理

画像とテキスト入力のバッチ処理をサポートし、推論効率を向上させます。

ネイティブTransformers実装

ネイティブTransformersモデルとして、SDPAやFA2を含む複数のアテンション実装など、コアライブラリ機能をサポートします。

モデル能力

画像キャプション生成

動画コンテンツ理解

マルチモーダル対話

テキスト生成

多言語サポート

バッチ推論

使用事例

コンテンツ理解と生成

画像キャプション

入力画像に基づいて詳細な説明を生成

詳細を含む自然言語説明を生成

動画分析

動画コンテンツを理解し質問に回答

動画中の動作とシーンを正確に識別

クリエイティブコンテンツ生成

詩の創作

画像またはテキストプロンプトに基づいて詩を生成

テーマに沿った創造的なテキストを生成

産業応用

産業画像分析

産業シーンにおける画像を分析

産業シーン中の特定オブジェクトと状態を識別

license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE pipeline_tag: image-text-to-text library_name: transformers base_model:

OpenGVLab/InternVL3-8B-Instruct base_model_relation: finetune datasets:
OpenGVLab/MMPR-v1.2 language:
multilingual tags:
internvl

InternVL3-8B Transformers 🤗 実装

[📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 ブログ] [🗨️ チャットデモ] [🤗 HFデモ] [🚀 クイックスタート] [📖 ドキュメント]

[!IMPORTANT] このリポジトリには、OpenGVLab/InternVL3-8BモデルのHugging Face 🤗 Transformers実装が含まれています。これは元のOpenGVLabリリースと機能的に同等となることを意図しています。ネイティブのTransformersモデルとして、様々なアテンション実装（eager、SDPA含む、FA2）や、画像・動画・テキスト入力を交互に効率的にバッチ処理するコアライブラリ機能をサポートしています。

イントロダクション

私たちは、優れた総合性能を示す先進的なマルチモーダル大規模言語モデル（MLLM）シリーズであるInternVL3を紹介します。 InternVL 2.5と比較して、InternVL3は優れたマルチモーダル知覚・推論能力を示しつつ、ツール使用、GUIエージェント、産業画像分析、3D視覚知覚など、マルチモーダル能力をさらに拡張しています。さらに、InternVL3をQwen2.5 Chatモデルと比較しました。対応する事前学習済みベースモデルはInternVL3の言語コンポーネントの初期化に使用されています。ネイティブマルチモーダル事前学習の恩恵により、InternVL3シリーズはQwen2.5シリーズよりもさらに優れた総合的なテキスト性能を達成しています。

image/png

オリジナルのチェックポイントOpenGVLab/InternVL3-8BでInternVL3ファミリーの詳細情報を見つけることができます。

使用例

パイプラインによる推論

image-text-to-textパイプラインを使用して、InternVL3モデルで推論を実行する方法を数行のコードで示します：

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "この画像を説明してください。"},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-8B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'この画像は自然の活気ある光景を描いており、いくつかの花と一匹の蜂が特徴です。\n\n1. **前景の花**: \n   - 主な焦点は、大きなピンクのコスモスで、目立つ黄色い中心部があります。花びらは柔らかく、少し'

単一画像での推論

この例では、チャットテンプレートを使用してInternVLモデルで単一画像の推論を実行する方法を示します。

[!NOTE] モデルはチャット用に特定のプロンプト形式で訓練されています。正しくプロンプトをフォーマットするにはprocessor.apply_chat_template(my_conversation_dict)を使用してください。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "この画像を明示的に説明してください。"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
'この画像はピンクの毛布の上に横たわる2匹の猫を示しています。左側の猫は茶色、黒、白の混ざった毛色のトラ猫で、毛布に頭を乗せて眠っているように見えます。右側の猫は'

テキストのみの生成

この例では、画像入力を提供せずにInternVLモデルを使用してテキストを生成する方法を示します。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "text", "text": "俳句を書いてください"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> print(decoded_output)
"夜明けの囁き、\n静かな夜の囁き、\n新たな光の始まり。"

バッチ処理された画像とテキスト入力

InternVLモデルはバッチ処理された画像とテキスト入力もサポートしています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "この画像の俳句を書いてください"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "この画像を説明してください"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nこの画像の俳句を書いてください\nassistant\n絹のような湖、\n木の桟橋、\n自然の安らぎ。",
 'user\n\nこの画像を説明してください\nassistant\nこの画像は「中国門」または「中国式アーチ」として知られる伝統的な中国のアーチウェイのある街の風景を示しています。']

バッチ処理された複数画像入力

このInternVLモデルの実装は、テキストごとに異なる数の画像を持つバッチ処理されたテキスト-画像入力をサポートしています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "この画像の俳句を書いてください"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "これらの画像は2つの異なるランドマークを描いています。それらを特定できますか？"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nこの画像の俳句を書いてください\nassistant\n絹のような湖、\n木の桟橋、\n自然の安らぎ。",
 'user\n\n\nこれらの画像は2つの異なるランドマークを描いています。それらを特定できますか？\nassistant\nはい、これらの画像は自由の女神とゴールデンゲートブリッジを描いています。']

動画入力

InternVLモデルは動画入力も処理できます。以下は、チャットテンプレートを使用して動画入力で推論を実行する例です。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "video",
...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
...             },
...             {"type": "text", "text": "この男性はどんなショットをしていますか？"},
...         ],
...     }
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     return_tensors="pt",
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
>>> ).to(model.device, dtype=torch.float16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'この男性はフォアハンドショットをしています。'

交互の画像と動画入力

この例では、チャットテンプレートを使用して、交互の画像と動画入力を持つチャット会話のバッチを処理する方法を示します。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "これらの画像は2つの異なるランドマークを描いています。それらを特定できますか？"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
...                 {"type": "text", "text": "この男性はどんなショットをしていますか？"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "この画像の俳句を書いてください"},
...             ],
...         },
...     ],
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     padding=True,
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)

>>> outputs = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nこれらの画像は2つの異なるランドマークを描いています。それらを特定できますか？\nassistant\nこれらの画像は自由の女神とゴールデンゲートブリッジを描いています。',
 'user\nフレーム1: \nフレーム2: \nフレーム3: \nフレーム4: \nフレーム5: \nフレーム6: \nフレーム7: \nフレーム8: \nこの男性はどんなショットをしていますか？\nassistant\nフォアハンドショット',
 "user\n\nこの画像の俳句を書いてください\nassistant\n絹のような湖、\n木の桟橋、\n自然の安らぎ。"]

ライセンス

このプロジェクトはMITライセンスの下でリリースされています。このプロジェクトは事前学習済みのQwen2.5をコンポーネントとして使用しており、Qwenライセンスの下でライセンスされています。

引用

このプロジェクトが研究に役立った場合は、以下の引用を検討してください：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}