Open-source test_mask2former_swin_large_cityscapes_semantic model - Efficiently handle Cityscapes semantic segmentation tasks

Test Mask2former Swin Large Cityscapes Semantic

Developed by kroixy

Large-scale Mask2Former model based on Swin backbone network, specifically trained for Cityscapes semantic segmentation tasks, using a unified architecture for image segmentation tasks

Image Segmentation

Safetensors

Open Source License:Other #Unified Image Segmentation #Multi-scale Attention #Swin Backbone Network

Downloads 22

Release Time : 2/11/2025

Model Overview

Mask2Former is a universal image segmentation model that handles instance segmentation, semantic segmentation, and panoptic segmentation tasks uniformly by predicting a set of masks and their corresponding labels. It shows improvements in both performance and efficiency compared to previous models.

Model Features

Unified Segmentation Architecture

Handles instance segmentation, semantic segmentation, and panoptic segmentation tasks uniformly through a paradigm of predicting masks and labels

Masked Attention Mechanism

Innovatively adopts a Transformer decoder with masked attention mechanism, improving performance without increasing computational load

Efficient Training Strategy

Significantly enhances training efficiency by computing loss on subsampled points rather than entire masks

Multi-scale Feature Processing

Uses multi-scale deformable attention Transformer instead of traditional pixel decoder to enhance feature extraction capability

Model Capabilities

Image Semantic Segmentation

Multi-category Object Recognition

Pixel-level Annotation

Use Cases

Autonomous Driving

Street Scene Semantic Understanding

Performs pixel-level segmentation of various elements in urban road scenes (such as vehicles, pedestrians, roads, etc.)

Can be used in the environmental perception module of autonomous driving systems

Geographic Information Systems

Aerial Image Analysis

Classifies and identifies buildings, vegetation, water bodies, etc., in aerial or satellite images

Assists in urban planning and land resource management

🚀 Mask2Former

The Mask2Former model is trained on Cityscapes semantic segmentation (large-sized version, Swin backbone). It offers a unified approach for instance, semantic, and panoptic segmentation.

🚀 Quick Start

The Mask2Former model trained on Cityscapes semantic segmentation (large-sized version, Swin backbone) was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository.

Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

Unified Segmentation Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation with the same paradigm by predicting a set of masks and corresponding labels, treating all 3 tasks as instance segmentation.
Performance and Efficiency: It outperforms the previous SOTA, MaskFormer, both in terms of performance and efficiency. This is achieved by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation, and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

model image

📚 Documentation

Intended uses & limitations

You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine - tuned versions on a task that interests you.

How to use

Here is how to use this model:

import requests
import torch
from PIL import Image
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation

# load Mask2Former fine-tuned on Cityscapes semantic segmentation
processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-large-cityscapes-semantic")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-large-cityscapes-semantic")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)

For more code examples, we refer to the documentation.

📄 License

License: other

Property	Details
Tags	vision, image - segmentation
Datasets	coco

⚠️ Important Note

The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team.

💡 Usage Tip

You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine - tuned versions on a task that interests you.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご