MambaVision-B-21K Open-Source Vision Model - Integrating Advantages to Enhance Feature and Long-Range Spatial Modeling Efficiency

Mambavision B 21K

Developed by nvidia

The first hybrid computer vision model combining the strengths of Mamba and Transformer, enhancing visual feature modeling efficiency through reconstructed Mamba formulas and introducing self-attention modules at the end of the Mamba architecture to improve long-range spatial dependency modeling.

Image Classification

Transformers

Open Source License:Other #Hybrid Mamba-Transformer #High-Precision Image Classification #Hierarchical Feature Extraction

Downloads 1,395

Release Time : 3/24/2025

Model Overview

MambaVision is a hierarchical visual backbone network that combines the advantages of Mamba and Transformer, suitable for image classification and feature extraction tasks.

Model Features

Hybrid Architecture Innovation

First combination of Mamba and Transformer, reconstructing Mamba formulas to optimize visual feature modeling efficiency

Hierarchical Structure Design

Provides a series of models with hierarchical structures to meet diverse design needs

Performance Optimization

Introduces self-attention modules at the end of the Mamba architecture, significantly improving long-range spatial dependency modeling

Model Capabilities

Image Classification

Visual Feature Extraction

Use Cases

Computer Vision

Image Classification

Classify input images

Achieves 84.9% Top-1 accuracy on ImageNet-1K

Feature Extraction

Obtain four-stage feature maps and global average pooling features of images

🚀 MambaVision: A Hybrid Mamba-Transformer Vision Backbone

We've developed the first hybrid model for computer vision that combines Mamba and Transformers, enhancing visual feature modeling and long - range spatial dependency capture.

🚀 Quick Start

The MambaVision model is designed for image - classification tasks. You can quickly start using the model by following the steps below.

✨ Features

Hybrid Design: The first hybrid model for computer vision, leveraging the strengths of Mamba and Transformers.
Enhanced Mamba Formulation: Redesigned to better model visual features.
Ablation Study: Comprehensive research on integrating Vision Transformers (ViT) with Mamba.
Hierarchical Architecture: A family of models with hierarchical structures to meet different design needs.

📦 Installation

It is highly recommended to install the requirements for MambaVision by running the following:

pip install mambavision

💻 Usage Examples

Basic Usage

The following shows how to use MambaVision for image classification and feature extraction.

Image Classification

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-B-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Feature Extraction

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-B-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_pct,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])

📚 Documentation

Model Overview

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.

Model Performance

MambaVision - B - 21K is pretrained on ImageNet - 21K dataset and finetuned on ImageNet - 1K.

Property	Details
Model Type	MambaVision - B - 21K
Training Data	Pretrained on ImageNet - 21K, finetuned on ImageNet - 1K

Name	Acc@1(%)	Acc@5(%)	#Params(M)	FLOPs(G)	Resolution
MambaVision - B - 21K	84.9	97.5	97.7	15.0	224x224

In addition, the MambaVision models demonstrate a strong performance by achieving a new SOTA Pareto - front in terms of Top - 1 accuracy and throughput.

📄 License

This model is released under the NVIDIA Source Code License - NC.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご