Mask2Former Open-Source Image Segmentation Model - Free Deployment to Handle Instance, Semantic, and Panoramic Segmentation!

Mask2former Swin Tiny Cityscapes Semantic

Developed by facebook

Mask2Former is a unified image segmentation framework capable of handling instance segmentation, semantic segmentation, and panoptic segmentation tasks. This model is based on the Swin-Tiny backbone network and has been fine-tuned for semantic segmentation on the Cityscapes dataset.

Image Segmentation

Transformers

Open Source License:Other #Unified Image Segmentation #Multi-scale Attention #Mask Prediction

Downloads 55.98k

Release Time : 1/5/2023

Model Overview

Mask2Former unifies instance segmentation, semantic segmentation, and panoptic segmentation by predicting a set of masks and their corresponding labels, treating all three tasks as instance segmentation problems. Compared to its predecessor MaskFormer, Mask2Former shows significant improvements in both performance and efficiency.

Model Features

Unified Segmentation Framework

Unifies instance segmentation, semantic segmentation, and panoptic segmentation into a single framework

Efficient Attention Mechanism

Uses multi-scale deformable attention Transformer to replace traditional pixel decoders

Masked Attention Mechanism

Introduces Transformer decoder with masked attention mechanism, improving performance without increasing computational cost

Efficient Training Strategy

Significantly improves training efficiency by computing loss on sampled points rather than entire masks

Model Capabilities

Image Segmentation

Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

Use Cases

Autonomous Driving

Street Scene Semantic Segmentation

Performs semantic segmentation on urban street scenes to identify elements such as roads, buildings, and pedestrians

Excellent performance on the Cityscapes dataset

Medical Imaging

Medical Image Analysis

Can be used for organ or lesion segmentation in medical images

🚀 Mask2Former

The Mask2Former model is trained on Cityscapes semantic segmentation (tiny-sized version, Swin backbone). It offers a unified solution for image segmentation tasks.

🚀 Quick Start

The Mask2Former model, trained on Cityscapes semantic segmentation, is a powerful tool for image segmentation. It can handle instance, semantic, and panoptic segmentation tasks.

import requests
import torch
from PIL import Image
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former fine-tuned on Cityscapes semantic segmentation
processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)

✨ Features

Unified Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation with the same paradigm by predicting a set of masks and corresponding labels.
Performance and Efficiency: It outperforms the previous SOTA, MaskFormer, both in terms of performance and efficiency. It achieves this by (i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation, and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

📚 Documentation

Model description

Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

model image

Intended uses & limitations

You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine - tuned versions on a task that interests you.

How to use

Here is how to use this model:

import requests
import torch
from PIL import Image
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former fine-tuned on Cityscapes semantic segmentation
processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)

For more code examples, we refer to the documentation.

📄 License

The license for this model is other.

Property	Details
Tags	vision, image - segmentation
Datasets	coco
Widget Examples	Cats, Castle

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご