mask2former-swin-large-cityscapes-semantic Open-source Model - Handles multi-class image segmentation, specifically for urban landscape semantic segmentation

Mask2former Swin Large Cityscapes Semantic

Developed by facebook

A large-scale Mask2Former model based on the Swin backbone network, specifically trained for Cityscapes semantic segmentation tasks, adopting a unified architecture for various image segmentation tasks.

Image Segmentation

Transformers

Open Source License:Other #Panoptic Segmentation #Swin Backbone Network #Multi-scale Attention

Downloads 296.33k

Release Time : 1/5/2023

Model Overview

Mask2Former is an advanced image segmentation model capable of handling instance segmentation, semantic segmentation, and panoptic segmentation tasks in a unified manner. This specific version is optimized for urban street scene semantic segmentation.

Model Features

Unified Segmentation Architecture

Handles instance segmentation, semantic segmentation, and panoptic segmentation tasks uniformly by predicting a set of masks and their corresponding labels.

Improved Attention Mechanism

Utilizes multi-scale deformable attention Transformer and mask attention mechanisms to enhance performance without increasing computational overhead.

Efficient Training Strategy

Significantly improves training efficiency by computing losses on downsampled points rather than entire masks.

Model Capabilities

Image Semantic Segmentation

Street Scene Image Analysis

Multi-category Object Recognition

Use Cases

Intelligent Transportation Systems

Urban Street Scene Parsing

Automatically identifies and segments urban street scene elements such as roads, vehicles, and pedestrians.

Can be used for traffic flow analysis, autonomous driving environment perception, and other applications.

Geographic Information Systems

Satellite Image Analysis

Performs semantic segmentation on satellite or aerial images.

Can be used for urban planning, land use classification, and similar scenarios.

🚀 Mask2Former

The Mask2Former model is trained on Cityscapes semantic segmentation (large - sized version, Swin backbone). It offers a unified approach for instance, semantic, and panoptic segmentation.

🚀 Quick Start

The Mask2Former model trained on Cityscapes semantic segmentation (large - sized version, Swin backbone) is introduced in the paper Masked - attention Mask Transformer for Universal Image Segmentation and first released in this repository.

Disclaimer: The team releasing Mask2Former did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Unified Segmentation Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation using the same approach, treating all three tasks as instance segmentation by predicting a set of masks and corresponding labels.
Performance and Efficiency: It outperforms the previous SOTA, MaskFormer, in both performance and efficiency through several key improvements:
- Replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer.
- Adopting a Transformer decoder with masked attention to boost performance without additional computation.
- Improving training efficiency by calculating the loss on subsampled points instead of whole masks.

model image

📚 Documentation

Model description

Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

Intended uses & limitations

You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine - tuned versions on a task that interests you.

How to use

Here is how to use this model:

import requests
import torch
from PIL import Image
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former fine-tuned on Cityscapes semantic segmentation
processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-large-cityscapes-semantic")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-large-cityscapes-semantic")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)

For more code examples, we refer to the documentation.

📄 License

This model is released under the other license.

Property	Details
Model Type	Mask2Former model trained on Cityscapes semantic segmentation (large - sized version, Swin backbone)
Training Data	Cityscapes, COCO
Tags	vision, image - segmentation
Widget Examples	Cats, Castle

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご