InternViT-300Mオープンソースビジュアルモデル - 複数のビジュアルタスクをサポートし、無料で使えます

ホーム

Vit Intern300m Patch14 448.ogvl Dist

timmによって開発

InternViT-300MはOpenGVLabチームによって開発された視覚Transformerモデルで、InternViT-6Bから蒸留された事前学習モデルであり、様々な視覚タスクをサポートします。

画像分類

Transformers

オープンソースライセンス:MIT #マルチモーダル視覚特徴 #高解像度448px #OCR強化

ダウンロード数 147

リリース時間 : 10/16/2024

モデル概要

このモデルはViTアーキテクチャに基づく画像特徴抽出モデルで、主に画像分類と特徴抽出タスクに使用され、448x448解像度の画像入力をサポートします。

モデル特徴

高解像度サポート

448x448の高解像度画像入力をサポートし、精密な視覚特徴が必要なタスクに適しています。

複数データセットでの事前学習

LAION-en/zh、COYO、GRITなどの大規模データセットで事前学習されており、強力な汎化能力を持っています。

蒸留モデル

より大きなInternViT-6Bモデルから蒸留されており、性能を維持しながらモデルサイズを縮小しています。

モデル能力

画像分類

視覚特徴抽出

画像埋め込み生成

使用事例

コンピュータビジョン

画像分類

入力画像を分類し、画像内の主要なオブジェクトやシーンを識別します。

複数のベンチマークデータセットで優れた性能を発揮

視覚特徴抽出

画像の深層視覚特徴を抽出し、物体検出や画像検索などの下流タスクに使用できます。

🚀 vit_intern300m_patch14_448.ogvl_dist

このモデルはInternViTの画像特徴モデルです。論文の著者により、InternViT - 6B から蒸留を用いて、様々な画像テキストデータで事前学習されています。モデルの重みは、OpenGVLab/InternViT - 300M - 448px から timm のvit形式に変換されています。なお、このvitは特徴/ヘッドの前に最終的な正規化が行われていません。

🚀 クイックスタート

このモデルは画像分類や特徴抽出に使用できます。以下のセクションで具体的な使用方法を説明します。

✨ 主な機能

画像分類
画像特徴抽出
画像埋め込み生成

📚 ドキュメント

モデルの詳細

属性	详情
モデルタイプ	画像分類 / 特徴バックボーン
モデル統計情報	パラメータ (M): 304.0 GMACs: 362.0 アクティベーション (M): 656.4 画像サイズ: 448 x 448
論文	InternVL2: Better than the Best: https://internvl.github.io/blog/2024 - 07 - 02 - InternVL - 2.0/ InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual - Linguistic Tasks: https://arxiv.org/abs/2312.14238
オリジナル	https://github.com/OpenGVLab/InternVL
データセット	LAION - en LAION - zh COYO GRIT COCO TextCaps Objects365 OpenImages All - Seeing Wukong - OCR LaionCOCO - OCR other - OCR

💻 使用例

基本的な使用法

画像分類

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('vit_intern300m_patch14_448.ogvl_dist', pretrained=True)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

特徴マップ抽出

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_intern300m_patch14_448.ogvl_dist',
    pretrained=True,
    features_only=True,
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

for o in output:
    # print shape of each feature map in output
    # e.g.:
    #  torch.Size([1, 1024, 32, 32])
    #  torch.Size([1, 1024, 32, 32])
    #  torch.Size([1, 1024, 32, 32])

    print(o.shape)

画像埋め込み

from urllib.request import urlopen
from PIL import Image
import timm

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model(
    'vit_intern300m_patch14_448.ogvl_dist',
    pretrained=True,
    num_classes=0,  # remove classifier nn.Linear
)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # output is (batch_size, num_features) shaped tensor

# or equivalently (without needing to set num_classes=0)

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 1025, 1024) shaped tensor

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

引用

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}

@article{chen2023internvl,
  title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2312.14238},
  year={2023}
}