MambaVision-S-1K Open-source Vision Model - Combining Advantages to Enhance Visual Features and Long-range Spatial Modeling Capabilities

Mambavision S 1K

Developed by nvidia

The first hybrid computer vision model combining the advantages of Mamba and Transformer, enhancing visual feature modeling efficiency by reconstructing the Mamba formula, and improving long-range spatial dependency modeling by adding a self-attention module at the end of the Mamba architecture.

Image Classification

Transformers

Open Source License:Other #Hybrid Mamba-Transformer Architecture #Hierarchical Feature Extraction #Long-range Spatial Modeling

Downloads 908

Release Time : 7/14/2024

Model Overview

MambaVision is a visual backbone network that combines the strengths of Mamba and Transformer, primarily used for image classification and feature extraction tasks, featuring efficient visual feature modeling and long-range spatial dependency processing capabilities.

Model Features

Hybrid Architecture

Combines the advantages of Mamba and Transformer, reconstructing the Mamba formula to enhance visual feature modeling efficiency.

Long-range Spatial Dependency Modeling

Adds a self-attention module at the end of the Mamba architecture, significantly improving long-range spatial dependency modeling.

Hierarchical Architecture

Offers the MambaVision series models with a hierarchical architecture to meet diverse design needs.

High Performance

Achieves a new SOTA Pareto frontier in Top-1 accuracy and throughput.

Model Capabilities

Image Classification

Feature Extraction

Multi-stage Feature Output

Use Cases

Computer Vision

Image Classification

Use MambaVision for image classification, such as identifying animal species.

Predicted class: Brown bear (brown bear, bruin, Ursus arctos)

Feature Extraction

Use MambaVision as a general feature extractor to obtain four-stage hierarchical features and final average pooling features.

Can obtain feature maps from four stages and average pooling features

🚀 MambaVision: A Hybrid Mamba-Transformer Vision Backbone

We've developed a hybrid Mamba-Transformer vision backbone for efficient image feature extraction.

🚀 Quick Start

It is highly recommended to install the requirements for MambaVision by running the following:

pip install mambavision

✨ Features

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers.
Redesigned the Mamba formulation to enhance its capability for efficient modeling of visual features.
Conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
Equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies.
Introduced a family of MambaVision models with a hierarchical architecture to meet various design criteria.

📦 Installation

Run the following command to install MambaVision:

pip install mambavision

💻 Usage Examples

Basic Usage

Image Classification

In the following example, we demonstrate how MambaVision can be used for image classification. Given an image from COCO dataset val set as an input:

The following snippet can be used for image classification:

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-S-1K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Feature Extraction

MambaVision can also be used as a generic feature extractor. The following snippet can be used for feature extraction:

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-S-1K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])

📚 Documentation

Model Overview

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.

Model Performance

MambaVision demonstrates a strong performance by achieving a new SOTA Pareto - front in terms of Top - 1 accuracy and throughput.

📄 License

NVIDIA Source Code License - NC

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご