Mask2Former-Swin-Small Open-source Model - Free Deployment to Aid Cityscapes Semantic Segmentation Tasks

Mask2former Swin Small Cityscapes Semantic

Developed by facebook

Small version of Mask2Former based on Swin backbone network, specifically trained for Cityscapes semantic segmentation tasks

Image Segmentation

Transformers

Open Source License:Other #Unified Image Segmentation #Multi-scale Attention #Mask Prediction

Downloads 952

Release Time : 1/5/2023

Model Overview

Mask2Former is a universal image segmentation model that uses a unified paradigm to handle instance segmentation, semantic segmentation, and panoptic segmentation tasks. It achieves segmentation by predicting a set of masks and their corresponding labels.

Model Features

Unified Segmentation Paradigm

Unifies instance segmentation, semantic segmentation, and panoptic segmentation as instance segmentation tasks

Efficient Attention Mechanism

Uses multi-scale deformable attention Transformer to replace traditional pixel decoders

Masked Attention Decoder

Introduces a Transformer decoder with masked attention, improving performance without increasing computational cost

Efficient Training Method

Calculates loss via sampled points rather than entire masks, significantly improving training efficiency

Model Capabilities

Image Semantic Segmentation

Multi-category Object Recognition

High-precision Mask Prediction

Use Cases

Autonomous Driving

Street Scene Semantic Segmentation

Performs pixel-level classification of urban road scenes to identify elements such as roads, vehicles, and pedestrians

Excellent performance on the Cityscapes dataset

Remote Sensing Image Analysis

Land Cover Classification

Performs segmentation of land cover types on satellite or aerial images

🚀 Mask2Former

The Mask2Former model is trained on Cityscapes semantic segmentation (small - sized version, Swin backbone). It offers a unified approach to handle instance, semantic, and panoptic segmentation tasks.

🚀 Quick Start

The Mask2Former model was introduced in the paper Masked - attention Mask Transformer for Universal Image Segmentation and first released in this repository. Note that the team releasing Mask2Former did not write a model card for this model, and this card is written by the Hugging Face team.

✨ Features

Unified Segmentation Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation using the same approach, treating all three tasks as instance segmentation by predicting a set of masks and corresponding labels.
Performance and Efficiency: It outperforms the previous SOTA, MaskFormer, both in terms of performance and efficiency. It achieves this by replacing the pixel decoder with a multi - scale deformable attention Transformer, adopting a Transformer decoder with masked attention, and improving training efficiency by calculating the loss on subsampled points.

model image

📚 Documentation

Model Description

Mask2Former uses a unified paradigm to handle instance, semantic, and panoptic segmentation. It predicts a set of masks and corresponding labels, treating all three tasks as instance segmentation. It outperforms MaskFormer through several improvements:

Replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer.
Adopting a Transformer decoder with masked attention to boost performance without additional computation.
Improving training efficiency by calculating the loss on subsampled points instead of whole masks.

Intended Uses & Limitations

You can use this particular checkpoint for panoptic segmentation. Check the model hub to find other fine - tuned versions for tasks that interest you.

How to Use

Here is how to use this model:

import requests
import torch
from PIL import Image
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former fine-tuned on Cityscapes semantic segmentation
processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-cityscapes-semantic")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-small-cityscapes-semantic")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)

For more code examples, refer to the documentation.

📄 License

The license for this model is other.

Property	Details
Model Type	Mask2Former model trained on Cityscapes semantic segmentation (small - sized version, Swin backbone)
Training Data	coco

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご