MambaVision-L2-512-21K Open-Source Computer Vision Model - Combining Advantages to Enhance Visual Feature Modeling Capability

Mambavision L2 512 21K

Developed by nvidia

The first hybrid computer vision model combining the advantages of Mamba and Transformer, enhancing visual feature modeling capability by reconstructing the Mamba formula

Image Classification

Transformers

Open Source License:Other #Hybrid Mamba-Transformer #High-Resolution Image Classification #Long-Range Spatial Modeling

Downloads 2,678

Release Time : 3/24/2025

Model Overview

MambaVision is a hybrid computer vision model that combines the strengths of Mamba and Transformer architectures, specifically optimized for visual feature modeling. The model is pre-trained on ImageNet-21K and fine-tuned on ImageNet-1K at 512×512 resolution, achieving excellent image classification performance.

Model Features

Hybrid Architecture Innovation

Successfully combines the advantages of Mamba and Transformer architectures for the first time, reconstructing the Mamba formula to enhance visual feature modeling capability

Hierarchical Architecture Design

Adopts a hierarchical architecture design, incorporating self-attention modules in the final layers of the Mamba architecture, significantly improving long-range spatial dependency modeling

High-Performance Results

Achieves a new SOTA Pareto frontier in Top-1 accuracy and throughput, reaching 87.3% Top-1 accuracy

Model Capabilities

Image Classification

Visual Feature Extraction

Use Cases

Computer Vision

General Image Classification

Classifies input images to identify the main objects or scenes within them

Achieves 87.3% Top-1 accuracy on ImageNet-1K

Visual Feature Extraction

Serves as a general feature extractor, obtaining feature maps from four stages and final average pooling features

Supports acquiring feature representations at different levels, suitable for downstream vision tasks

🚀 MambaVision: A Hybrid Mamba - Transformer Vision Backbone

This project presents the first hybrid model for computer vision that combines the advantages of Mamba and Transformers. It addresses the need for more efficient visual feature modeling and long - range spatial dependency capture in computer vision tasks.

🚀 Quick Start

To quickly get started with MambaVision, you need to install the necessary requirements. It is highly recommended to run the following command:

pip install mambavision

✨ Features

Model Overview

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.

Model Performance

MambaVision - L2 - 512 - 21K is pretrained on the ImageNet - 21K dataset and finetuned on ImageNet - 1K at a 512 x 512 resolution.

Property	Details
Model Type	Image Classification
Training Data	ILSVRC/imagenet - 21K

The following table shows the performance metrics:

Name	Acc@1(%)	Acc@5(%)	#Params(M)	FLOPs(G)	Resolution
MambaVision - L2 - 512 - 21K	87.3	98.4	241.5	196.3	512x512

In addition, the MambaVision models demonstrate strong performance by achieving a new SOTA Pareto - front in terms of Top - 1 accuracy and throughput.

Model Performance

📦 Installation

To install MambaVision, run the following command:

pip install mambavision

💻 Usage Examples

Basic Usage

Image Classification

In the following example, we demonstrate how MambaVision can be used for image classification. Given an image from the COCO dataset val set as an input:

Input Image

The following Python code can be used for image classification:

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-L2-512-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 512, 512)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

The predicted label is brown bear, bruin, Ursus arctos.

Feature Extraction

MambaVision can also be used as a generic feature extractor. Specifically, we can extract the outputs of each stage of the model (4 stages) as well as the final averaged - pool features that are flattened.

The following Python code can be used for feature extraction:

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-L2-512-21K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 512, 512)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 1568])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 196, 128, 128])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 1568, 16, 16])

📄 License

This project is licensed under the NVIDIA Source Code License - NC.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご