MambaVision-L-21K Open-Source Vision Model - Exceptional for Visual Tasks like Image Classification

Mambavision L 21K

Developed by nvidia

MambaVision is a hybrid Mamba-Transformer visual backbone network designed for vision applications, combining the strengths of the Mamba formula and vision Transformers, delivering outstanding performance in image classification and downstream vision tasks.

Image Classification

Transformers

Open Source License:Other #Hybrid Mamba-Transformer #Efficient Visual Modeling #Long-range Spatial Dependencies

Downloads 571

Release Time : 3/24/2025

Model Overview

MambaVision is a novel hybrid Mamba-Transformer backbone network that enhances visual feature modeling by redesigning the Mamba formula and incorporates a self-attention block in the final layer to capture long-range spatial dependencies. This model achieves SOTA performance on the ImageNet-1K classification task and excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation.

Model Features

Hybrid Architecture Design

Combines the strengths of the Mamba formula and vision Transformers, redesigning the Mamba formula to enhance visual feature modeling.

Hierarchical Structure

Adopts a hierarchical architecture design to meet various design criteria, incorporating a self-attention block in the final layer to capture long-range spatial dependencies.

High Performance

Achieves 86.1% Top-1 accuracy on the ImageNet-1K classification task and excels in downstream vision tasks.

Efficient Inference

Achieves SOTA Pareto frontier in accuracy and throughput, balancing performance and efficiency.

Model Capabilities

Image Classification

Feature Extraction

Object Detection

Instance Segmentation

Semantic Segmentation

Use Cases

Computer Vision

Image Classification

Classifies input images to identify the main object categories.

Achieves 86.1% Top-1 accuracy on ImageNet-1K.

Feature Extraction

Extracts multi-level image features for downstream vision tasks.

Can extract features from 4 stages and the final average pooling layer.

Object Detection

Serves as a backbone network for object detection tasks.

Outperforms backbone networks of similar scale on the MS COCO dataset.

Semantic Segmentation

Serves as a backbone network for semantic segmentation tasks.

Outperforms backbone networks of similar scale on the ADE20K dataset.

🚀 MambaVision: A Hybrid Mamba-Transformer Vision Backbone

We propose a novel hybrid Mamba-Transformer backbone for vision applications, achieving SOTA performance in image classification and excelling in downstream tasks.

🚀 Quick Start

It is highly recommended to install the requirements for MambaVision by running the following:

pip install mambavision

✨ Features

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.
Redesign the Mamba formulation to enhance its capability for efficient modeling of visual features.
Conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
Introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.
Achieve a new State-of-the-Art (SOTA) performance in ImageNet-1K image classification and perform well in downstream tasks.

📚 Documentation

Model Description

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For Image classification on ImageNet-1K dataset, MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput. In downstream tasks such as object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones and demonstrates more favorable performance. Code: https://github.com/NVlabs/MambaVision.

Model Performance

MambaVision-L-21K is pretrained on ImageNet-21K dataset and finetuned on ImageNet-1K.

Name	Acc@1(%)	Acc@5(%)	#Params(M)	FLOPs(G)	Resolution
MambaVision-L-21K	86.1	97.9	227.9	34.9	224x224

In addition, the MambaVision models demonstrate a strong performance by achieving a new SOTA Pareto-front in terms of Top-1 accuracy and throughput.

Model Performance

💻 Usage Examples

Basic Usage

Image Classification

In the following example, we demonstrate how MambaVision can be used for image classification.

Given the following image from COCO dataset val set as an input:

Input Image

The following snippet can be used for image classification:

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-L-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

The predicted label is brown bear, bruin, Ursus arctos.

Feature Extraction

MambaVision can also be used as a generic feature extractor.

Specifically, we can extract the outputs of each stage of model (4 stages) as well as the final averaged-pool features that are flattened.

The following snippet can be used for feature extraction:

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-L-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])

📄 License

NVIDIA Source Code License-NC

Property	Details
Model Type	Hybrid Mamba-Transformer Vision Backbone
Training Data	ILSVRC/imagenet-21k
Pipeline Tag	image-classification
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご