MambaVision - L3 - 256 - 21Kオープンソースビジョンモデル - 融合の優位性を活かしてビジュアル特徴と長距離空間モデリング能力を向上させる

Mambavision L3 256 21K

nvidiaによって開発

MambaとTransformerの利点を組み合わせた初のコンピュータビジョンハイブリッドモデルで、Mambaの式を再構築して視覚特徴モデリングの効率を向上させ、Mambaアーキテクチャの最後の数層に自己注意モジュールを導入して長距離空間依存関係のモデリング能力を強化しました。

画像分類

Transformers

オープンソースライセンス:その他 #ハイブリッドMamba-Transformer #長距離空間モデリング #高精度画像分類

ダウンロード数 510

リリース時間 : 3/24/2025

モデル概要

MambaVisionは、画像分類と特徴抽出のために設計されたハイブリッドMamba-Transformer視覚バックボーンネットワークで、ImageNet-21Kデータセットで事前トレーニングされ、ImageNet-1Kでファインチューニングされています。

モデル特徴

ハイブリッドアーキテクチャ

Mambaの効率的なシーケンスモデリングとTransformerの長距離依存関係捕捉能力を組み合わせ、視覚特徴抽出を最適化します。

階層構造

階層設計を採用し、多様な視覚タスクのニーズに対応し、多段階特徴抽出をサポートします。

性能最適化

Top-1精度とスループットの両方で新しいSOTAパレートフロンティアを実現しました。

モデル能力

画像分類

視覚特徴抽出

多段階特徴マップ出力

使用事例

コンピュータビジョン

画像分類

入力画像を分類し、画像内の主要なオブジェクトを識別します。

ImageNet-1Kで87.3%のTop-1精度を達成しました。

特徴抽出

画像の多段階特徴マップを抽出し、下流の視覚タスクに使用します。

4段階の特徴マップ出力をサポートし、異なる粒度の視覚分析に適しています。

🚀 MambaVision: ハイブリッドMamba-Transformerビジョンバックボーン

このモデルは、MambaとTransformerの強みを生かしたコンピュータビジョン用のハイブリッドモデルです。視覚特徴の効率的なモデリング能力を向上させ、長距離の空間依存関係を捉える能力を高めています。

🔍 モデル情報

属性	详情
データセット	ILSVRC/imagenet-21k
ライセンス	other
ライセンス名	nvclv1
ライセンスリンク	LICENSE
パイプラインタグ	image-classification
ライブラリ名	transformers

📚 論文情報

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

💻 コードリポジトリ

https://github.com/NVlabs/MambaVision

🚀 クイックスタート

✨ 主な機能

我々は、コンピュータビジョン用の最初のハイブリッドモデルを開発しました。このモデルは、MambaとTransformerの強みを活用しています。具体的には、Mambaの定式化を再設計して、視覚特徴の効率的なモデリング能力を向上させました。また、Vision Transformers (ViT) とMambaの統合の実現可能性について包括的なアブレーション研究を行いました。結果として、Mambaアーキテクチャの最終層にいくつかの自己注意ブロックを備えることで、長距離の空間依存関係を捉えるモデリング能力が大幅に向上することがわかりました。これらの知見に基づいて、様々な設計基準を満たす階層型アーキテクチャのMambaVisionモデルファミリーを導入しました。

📦 インストール

MambaVisionの使用に必要なパッケージをインストールするには、以下のコマンドを実行してください。

pip install mambavision

💻 使用例

基本的な使用法

画像分類

以下のコードは、MambaVisionを使用して画像分類を行う例です。

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-L3-256-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 256, 256)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

特徴抽出

MambaVisionは、汎用的な特徴抽出器としても使用できます。以下のコードは、各ステージの出力と最終的な平均プーリングされた特徴を抽出する例です。

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-L3-256-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 256, 256)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 1568])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 196, 128, 128])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 1568, 16, 16])