MambaVision-T-1K Open-Source Computer Vision Model - Combining Advantages to Enhance Long-Distance Spatial Modeling Capability

Mambavision T 1K

Developed by nvidia

MambaVision is the first hybrid computer vision model combining the advantages of Mamba and Transformer, significantly enhancing the modeling capability of long-range spatial dependencies through redesigned Mamba formulas and integrated ViT modules.

Image Classification

Transformers

Open Source License:Other #Hybrid Mamba-Transformer #Efficient Visual Modeling #Long-range Spatial Dependencies

Downloads 2,323

Release Time : 7/14/2024

Model Overview

MambaVision is a hybrid Mamba-Transformer visual backbone network specifically designed for image classification and feature extraction tasks. It combines the efficient modeling capability of Mamba with the long-range dependency capturing ability of Transformer, achieving new SOTA levels in Top-1 accuracy and throughput.

Model Features

Hybrid Architecture Innovation

First to combine the advantages of Mamba and Transformer, redesigning Mamba formulas to enhance visual feature modeling capabilities

Hierarchical Design

Offers a series of models with hierarchical architectures to meet various design needs

Efficient Long-range Dependency Modeling

Incorporates multiple self-attention modules in the final layer of the Mamba architecture, significantly improving the ability to capture long-range spatial dependencies

Model Capabilities

Image classification

Image feature extraction

Multi-stage feature output

Use Cases

Computer Vision

Image Classification

Classifies and identifies input images, such as recognizing animal species

Successfully identified a brown bear in the example

Feature Extraction

Extracts multi-level feature representations of images for downstream tasks

Can output feature maps from 4 stages and average pooled features

🚀 MambaVision: A Hybrid Mamba-Transformer Vision Backbone

We've developed a hybrid model for computer vision that combines Mamba and Transformers, offering high performance in image classification and feature extraction.

🚀 Quick Start

It is highly recommended to install the requirements for MambaVision by running the following:

pip install mambavision

✨ Features

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers.
Redesigned the Mamba formulation to enhance its capability for efficient modeling of visual features.
Conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
Equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies.
Introduced a family of MambaVision models with a hierarchical architecture to meet various design criteria.

📦 Installation

pip install mambavision

💻 Usage Examples

Basic Usage

Image Classification

In the following example, we demonstrate how MambaVision can be used for image classification. Given the following image from COCO dataset val set as an input:

The following snippet can be used for image classification:

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()
# model inference
outputs = model(inputs)
logits = outputs['logits'] 
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

The predicted label is brown bear, bruin, Ursus arctos.

Feature Extraction

MambaVision can also be used as a generic feature extractor. Specifically, we can extract the outputs of each stage of model (4 stages) as well as the final averaged - pool features that are flattened.

The following snippet can be used for feature extraction:

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-T-1K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)
input_resolution = (3, 224, 224)  # MambaVision supports any input resolutions

transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)
inputs = transform(image).unsqueeze(0).cuda()
# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])

📚 Documentation

Model Overview

We have developed the first hybrid model for computer vision which leverages the strengths of Mamba and Transformers. Specifically, our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conducted a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self - attention blocks at the final layers greatly improves the modeling capacity to capture long - range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria.

Model Performance

MambaVision demonstrates a strong performance by achieving a new SOTA Pareto - front in terms of Top - 1 accuracy and throughput.

📄 License

NVIDIA Source Code License - NC

Additional Information

Property	Details
Datasets	ILSVRC/imagenet - 1k
Model Type	Hybrid Mamba - Transformer Vision Backbone
Pipeline Tag	image - classification
Tags	image - feature - extraction
Library Name	transformers
License Name	nvclv1
License Link	LICENSE

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご