MambaVision-L3-256-21K Open-Source Vision Model - Integrating Advantages to Enhance Visual Feature and Long-Range Spatial Modeling Capabilities

Mambavision L3 256 21K

Developed by nvidia

The first hybrid computer vision model combining the strengths of Mamba and Transformer, enhancing visual feature modeling efficiency by reconstructing the Mamba formula, and introducing self-attention modules in the final layers of the Mamba architecture to improve long-range spatial dependency modeling.

Image Classification

Transformers

Open Source License:Other #Hybrid Mamba-Transformer #Long-range spatial modeling #High-precision image classification

Downloads 510

Release Time : 3/24/2025

Model Overview

MambaVision is a hybrid Mamba-Transformer vision backbone network, designed for image classification and feature extraction, pre-trained on the ImageNet-21K dataset and fine-tuned on ImageNet-1K.

Model Features

Hybrid architecture

Combines Mamba's efficient sequence modeling with Transformer's long-range dependency capture capabilities to optimize visual feature extraction.

Hierarchical structure

Adopts a hierarchical design to meet the needs of diverse visual tasks, supporting multi-stage feature extraction.

Performance optimization

Achieves a new SOTA Pareto frontier in Top-1 accuracy and throughput.

Model Capabilities

Image classification

Visual feature extraction

Multi-stage feature map output

Use Cases

Computer vision

Image classification

Classifies input images to identify the main objects in the image.

Achieves 87.3% Top-1 accuracy on ImageNet-1K.

Feature extraction

Extracts multi-stage feature maps from images for downstream visual tasks.

Supports output of feature maps at 4 stages, suitable for visual analysis at different granularities.

🚀 MambaVision: A Hybrid Mamba-Transformer Vision Backbone

We have developed the first hybrid model for computer vision which combines the strengths of Mamba and Transformers, offering high - performance image classification capabilities.

📚 Documentation

MambaVision: A Hybrid Mamba-Transformer Vision Backbone.

Code: https://github.com/NVlabs/MambaVision

🔍 Model Overview

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.

📊 Model Performance

MambaVision - L3 - 256 - 21K is pretrained on ImageNet - 21K dataset and finetuned on ImageNet - 1K. Both pretraining and finetuning are performed at 256 x 256 resolution.

Name	Acc@1(%)	Acc@5(%)	#Params(M)	FLOPs(G)	Resolution
MambaVision - L3 - 256 - 21K	87.3	98.3	739.6	122.3	256x256

In addition, the MambaVision models demonstrate a strong performance by achieving a new SOTA Pareto - front in terms of Top - 1 accuracy and throughput.

📦 Installation

It is highly recommended to install the requirements for MambaVision by running the following:

pip install mambavision

💻 Usage Examples

For each model, we offer two variants for image classification and feature extraction that can be imported with 1 line of code.

🔍 Basic Usage - Image Classification

In the following example, we demonstrate how MambaVision can be used for image classification.

Given the following image from COCO dataset val set as an input:

The following snippet can be used for image classification:

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-L3-256-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 256, 256)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

The predicted label is brown bear, bruin, Ursus arctos.

🔍 Advanced Usage - Feature Extraction

MambaVision can also be used as a generic feature extractor.

Specifically, we can extract the outputs of each stage of model (4 stages) as well as the final averaged - pool features that are flattened.

The following snippet can be used for feature extraction:

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-L3-256-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 256, 256)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 1568])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 196, 128, 128])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 1568, 16, 16])

📄 License

NVIDIA Source Code License - NC

📋 Information Table

Property	Details
Model Type	Image Classification
Training Data	ILSVRC/imagenet - 21k
Library Name	transformers
Pipeline Tag	image - classification
License	other
License Name	nvclv1
License Link	LICENSE

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご