InternVL3 - 78B - hfオープンソースマルチモーダル大規模モデル - 画像、動画、テキスト入力による知覚推論をサポート

ホーム

Internvl3 78B Hf

OpenGVLabによって開発

InternVL3 は先進的なマルチモーダル大規模言語モデルシリーズで、強力なマルチモーダル知覚と推論能力を備え、画像、ビデオ、テキスト入力をサポートします。

画像生成テキスト

Transformers

その他オープンソースライセンス:その他 #マルチモーダル推論 #クロスモーダル理解 #産業画像分析

ダウンロード数 40

リリース時間 : 4/18/2025

モデル概要

InternVL3 は OpenGVLab が提供するマルチモーダル大規模言語モデルで、卓越した総合性能を示します。このモデルは画像、ビデオ、テキスト入力をサポートし、強力なマルチモーダル知覚と推論能力を備えており、さまざまな視覚-言語タスクに適用できます。

モデル特徴

マルチモーダル知覚

画像、ビデオ、テキスト入力をサポートし、強力なマルチモーダル知覚能力を備えています。

効率的な推論

バッチ推論をサポートし、交互に入力された画像、ビデオ、テキストを処理できます。

幅広い応用シーン

ツール使用、GUI エージェント、産業画像分析、3D 視覚知覚など、さまざまなシーンに適用できます。

優れた性能

総合的なテキスト性能において Qwen2.5 シリーズを上回ります。

モデル能力

画像記述

ビデオ理解

テキスト生成

マルチモーダル推論

バッチ処理

使用事例

画像理解

画像記述

入力された画像を詳細に記述します。

正確で詳細な画像記述テキストを生成します。

ランドマーク認識

画像中の有名なランドマークを識別します。

ランドマークの特徴を正確に識別し記述します。

ビデオ理解

アクション認識

ビデオ中の動作や行動を識別します。

ビデオ中のアクションタイプを正確に記述します。

クリエイティブ生成

俳句創作

画像やテキストプロンプトに基づいて俳句を作成します。

詩的な俳句テキストを生成します。

🚀 InternVL3-78B Transformers 🤗 実装

InternVL3-78Bは、高度なマルチモーダル大規模言語モデル（MLLM）シリーズで、卓越した総合性能を発揮します。このリポジトリは、Hugging Face 🤗 Transformersを用いたOpenGVLab/InternVL3-78Bモデルの実装を含んでいます。

🚀 クイックスタート

このリポジトリには、OpenGVLab/InternVL3-78B モデルのHugging Face 🤗 Transformers実装が含まれています。元のOpenGVLabのリリースと機能的に同等であることを意図しています。ネイティブのTransformersモデルとして、さまざまなアテンション実装（eager、SDPA、FA2を含む）などのコアライブラリ機能をサポートし、画像、ビデオ、テキストの入力を交互に行う効率的なバッチ推論を可能にします。

[📜 InternVL 1.0] [📜 InternVL 1.5] [📜 InternVL 2.5] [📜 InternVL2.5-MPO] [📜 InternVL3]

[🆕 Blog] [🗨️ Chat Demo] [🤗 HF Demo] [🚀 Quick Start] [📖 Documents]

✨ 主な機能

InternVL3は、高度なマルチモーダル大規模言語モデル（MLLM）シリーズで、総合性能に優れています。 InternVL 2.5と比較して、InternVL3は優れたマルチモーダル知覚と推論能力を示し、さらにマルチモーダル機能をツール使用、GUIエージェント、産業画像分析、3Dビジョン知覚などに拡張しています。また、InternVL3はQwen2.5 Chatモデルと比較されており、その対応する事前学習ベースモデルはInternVL3の言語コンポーネントの初期化として使用されています。ネイティブマルチモーダル事前学習の恩恵を受けて、InternVL3シリーズはQwen2.5シリーズよりもさらに優れた総合テキスト性能を達成しています。

image/png

InternVL3ファミリの詳細情報は、元のチェックポイント OpenGVLab/InternVL3-78B で見ることができます。

💻 使用例

基本的な使用法

パイプラインを使用した推論

以下は、image-text-to-text パイプラインを使用して InternVL3 モデルで推論を実行する方法の例です。

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-78B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'

単一画像での推論

この例は、チャットテンプレートを使用してInternVLモデルで単一画像の推論を実行する方法を示しています。

⚠️ 重要提示

モデルは特定のプロンプト形式でチャット用に学習されています。processor.apply_chat_template(my_conversation_dict) を使用してプロンプトを正しくフォーマットしてください。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Please describe the image explicitly."},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'

テキストのみの生成

この例は、画像入力を提供せずにInternVLモデルを使用してテキストを生成する方法を示しています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "text", "text": "Write a haiku"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> print(decoded_output)
"Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."

高度な使用法

バッチ画像とテキスト入力

InternVLモデルは、バッチ画像とテキスト入力もサポートしています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']

バッチ複数画像入力

このInternVLモデルの実装は、各テキストに対して異なる数の画像を持つバッチテキスト - 画像入力をサポートしています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']

ビデオ入力

InternVLモデルは、ビデオ入力も処理できます。以下は、チャットテンプレートを使用してビデオ入力で推論を実行する方法の例です。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig

>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "video",
...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
...             },
...             {"type": "text", "text": "What type of shot is the man performing?"},
...         ],
...     }
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     return_tensors="pt",
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
>>> ).to(model.device, dtype=torch.float16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'

交互の画像とビデオ入力

この例は、チャットテンプレートを使用して交互の画像とビデオ入力を持つバッチのチャット会話を処理する方法を示しています。

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-78B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
...                 {"type": "text", "text": "What type of shot is the man performing?"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
>>> ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     padding=True,
...     add_generation_prompt=True,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)

>>> outputs = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
 "user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace."]

📄 ライセンス

このプロジェクトはMITライセンスの下でリリースされています。このプロジェクトは、事前学習されたQwen2.5をコンポーネントとして使用しており、これはQwenライセンスの下でライセンスされています。

引用

もしこのプロジェクトがあなたの研究に役立った場合、以下を引用することを検討してください。

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}