MambaVision - L2 - 512 - 21Kオープンソースコンピュータビジョンモデル - 利点を組み合わせて視覺的特徴モデル化能力を強化

ホーム

Mambavision L2 512 21K

nvidiaによって開発

マンバ(Mamba)とTransformerの利点を組み合わせた初のハイブリッドコンピュータビジョンモデルで、マンバの式を再構築して視覚的特徴モデリング能力を強化

画像分類

Transformers

オープンソースライセンス:その他 #ハイブリッドマンバ-Transformer #高解像度画像分類 #長距離空間モデリング

ダウンロード数 2,678

リリース時間 : 3/24/2025

モデル概要

MambaVisionは、マンバとTransformerアーキテクチャの利点を組み合わせたハイブリッドコンピュータビジョンモデルで、特に視覚的特徴モデリング能力を最適化しています。このモデルはImageNet-21Kで事前学習され、512×512解像度でImageNet-1Kに対して微調整され、優れた画像分類性能を実現しました。

モデル特徴

ハイブリッドアーキテクチャの革新

初めてマンバ(Mamba)とTransformerアーキテクチャの利点を組み合わせ、マンバの式を再構築して視覚的特徴モデリング能力を強化

階層的アーキテクチャ設計

階層的アーキテクチャ設計を採用し、マンバアーキテクチャの最後の数層に自己注意モジュールを追加することで、長距離空間依存関係のモデリング能力を大幅に向上

高性能

Top-1精度とスループットの両方で新しいSOTAパレートフロンティアを達成し、87.3%のTop-1精度を実現

モデル能力

画像分類

視覚的特徴抽出

使用事例

コンピュータビジョン

汎用画像分類

入力画像を分類し、画像内の主要な物体やシーンを識別

ImageNet-1Kで87.3%のTop-1精度を達成

視覚的特徴抽出

汎用特徴抽出器として、4段階の特徴マップと最終的な平均プーリング特徴を取得

異なるレベルの特徴表現を取得可能で、下流の視覚タスクに適応

🚀 MambaVision-L2-512-21K

MambaVisionは、コンピュータビジョンのためのハイブリッドモデルで、MambaとTransformerの強みを活用した画像分類モデルです。

🚀 クイックスタート

このセクションでは、MambaVision-L2-512-21Kモデルの概要、性能、使用方法、ライセンスについて説明します。

✨ 主な機能

ハイブリッドモデル：MambaとTransformerの強みを組み合わせた、コンピュータビジョン用の初のハイブリッドモデルを開発しました。
Mambaの改良：Mambaの定式化を再設計し、視覚的特徴を効率的にモデリングする能力を向上させました。
ViTとの統合：Vision Transformers (ViT) とMambaの統合の実現可能性について包括的なアブレーション研究を行いました。
長距離依存関係のキャプチャ：Mambaアーキテクチャの最終層にいくつかの自己注意ブロックを備えることで、長距離の空間的依存関係をキャプチャするモデリング能力が大幅に向上します。
階層的アーキテクチャ：様々な設計基準を満たすために、階層的アーキテクチャを持つMambaVisionモデルファミリーを導入しました。

📦 インストール

MambaVisionの必要なパッケージをインストールするには、以下のコマンドを実行してください。

pip install mambavision

💻 使用例

基本的な使用法

画像分類

以下のコードは、MambaVisionを使用して画像分類を行う例です。

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-L2-512-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 512, 512)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

特徴抽出

MambaVisionは、一般的な特徴抽出器としても使用できます。以下のコードは、特徴抽出を行う例です。

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-L2-512-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 512, 512)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 1568])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 196, 128, 128])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 1568, 16, 16])

📚 ドキュメント

モデル概要

私たちは、MambaとTransformerの強みを活用した、コンピュータビジョン用の初のハイブリッドモデルを開発しました。具体的には、Mambaの定式化を再設計し、視覚的特徴を効率的にモデリングする能力を向上させました。また、Vision Transformers (ViT) とMambaの統合の実現可能性について包括的なアブレーション研究を行いました。結果は、Mambaアーキテクチャの最終層にいくつかの自己注意ブロックを備えることで、長距離の空間的依存関係をキャプチャするモデリング能力が大幅に向上することを示しています。これらの知見に基づいて、様々な設計基準を満たすために、階層的アーキテクチャを持つMambaVisionモデルファミリーを導入しました。