Open-source Model of video-mask2former-swin-small - Free for Video Instance Segmentation Tasks

Video Mask2former Swin Small Youtubevis 2021 Instance

Developed by shivalikasingh

Video Mask2Former model trained on the YouTubeVIS-2021 dataset for video instance segmentation tasks, using Swin Transformer as the backbone network.

Image Segmentation

Transformers

Open Source License:MIT #Video Instance Segmentation #Multi-frame Mask Prediction #Swin Backbone Network

Downloads 18

Release Time : 3/22/2023

Model Overview

This model is an extension of Mask2Former for video instance segmentation tasks, handling instance segmentation, semantic segmentation, and panoptic segmentation tasks through a unified paradigm, predicting a set of masks and their corresponding labels.

Model Features

Unified Segmentation Paradigm

Treats instance segmentation, semantic segmentation, and panoptic segmentation uniformly as instance segmentation problems.

Efficient Attention Mechanism

Uses multi-scale deformable attention Transformer to replace traditional pixel decoders.

Masked Attention Decoder

Employs a Transformer decoder with masked attention to improve performance without increasing computational load.

Efficient Training Strategy

Significantly enhances training efficiency by computing loss on subsampled points rather than entire masks.

Model Capabilities

Video Instance Segmentation

Multi-object Tracking

Dynamic Scene Analysis

Use Cases

Video Analysis

Autonomous Driving Scene Understanding

Identifies and tracks dynamic objects on the road.

Accurately segments moving vehicles and pedestrians.

Video Surveillance

Real-time analysis of multi-object movements in surveillance videos.

Supports simultaneous tracking and segmentation of multiple objects.

🚀 Video Mask2Former

A Video Mask2Former model trained on YouTubeVIS - 2021 instance segmentation (small - sized version, Swin backbone). It provides a solution for video instance segmentation tasks, extending the capabilities of the original Mask2Former.

🚀 Quick Start

The Video Mask2Former model is trained on YouTubeVIS - 2021 instance segmentation. It was introduced in the paper Mask2Former for Video Instance Segmentation and first released in this repository. It's an extension of the original Mask2Former paper Masked - attention Mask Transformer for Universal Image Segmentation.

Disclaimer: The team releasing Mask2Former did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Unified Segmentation Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation using the same approach. It predicts a set of masks and corresponding labels, treating all three tasks as instance segmentation.
Performance and Efficiency: It outperforms the previous SOTA MaskFormer in both performance and efficiency. It achieves this by:
- Replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer.
- Adopting a Transformer decoder with masked attention to boost performance without additional computation.
- Improving training efficiency by calculating the loss on subsampled points instead of whole masks.
Video Instance Segmentation: In the paper Mask2Former for Video Instance Segmentation, it's shown that Mask2Former achieves state - of - the - art performance on video instance segmentation without modifying the architecture, loss, or training pipeline.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchvision
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former trained on YouTubeVIS 2021 instance segmentation
processor = AutoImageProcessor.from_pretrained("facebook/video-mask2former-swin-small-youtubevis-2021-instance")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask2former-swin-small-youtubevis-2021-instance")

file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
video = torchvision.io.read_video(file_path)[0]
video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
video_input = torch.cat(video_frames)

with torch.no_grad():
    outputs = model(**video_input)

# model predicts class_queries_logits of shape `(batch_size, num_queries, num_classes)`
# and masks_queries_logits of shape `(num_queries, batch_size, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
result = image_processor.post_process_video_instance_segmentation(outputs, target_sizes=[tuple(video.shape[1:3])])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)
predicted_video_instance_map = result["segmentation"]

Advanced Usage

For more code examples, refer to the documentation.

📚 Documentation

Model description

Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

In the paper Mask2Former for Video Instance Segmentation, the authors have shown that Mask2Former also achieves state - of - the - art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.

model image

Intended uses & limitations

You can use this particular checkpoint for instance segmentation. See the model hub to look for other fine - tuned versions of this model that may interest you.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	Video Mask2Former (small - sized version, Swin backbone)
Training Data	YouTubeVIS - 2021
Tags	vision, image - segmentation

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご