Open-source model of video-mask2former-swin-tiny-youtubevis-2021-instance - A practical tool for micro video instance segmentation

Video Mask2former Swin Tiny Youtubevis 2021 Instance

Developed by shivalikasingh

A tiny video instance segmentation model trained on the YouTubeVIS-2021 dataset, utilizing a Swin Transformer backbone network

Image Segmentation

Transformers

Open Source License:MIT #Video Instance Segmentation #Swin Backbone Network #Masked Attention

Downloads 22

Release Time : 3/15/2023

Model Overview

Video Mask2Former is an extended version of Mask2Former, specifically designed for video instance segmentation tasks. It employs a unified architecture to handle segmentation tasks, achieving high-performance segmentation by predicting masks and their corresponding labels

Model Features

Unified Segmentation Architecture

Adopts a unified paradigm to handle instance segmentation, semantic segmentation, and panoptic segmentation tasks, treating all tasks as instance segmentation

Improved Attention Mechanism

Replaces the pixel decoder with a multi-scale deformable attention Transformer and employs a Transformer decoder with masked attention to enhance performance

Efficient Training Method

Significantly improves training efficiency by calculating losses based on sampled points rather than entire masks

Video Processing Capability

Directly applies to video instance segmentation tasks without modifying the architecture and achieves state-of-the-art performance

Model Capabilities

Video Instance Segmentation

Object Mask Prediction

Multi-frame Video Analysis

Use Cases

Video Analysis

Video Object Tracking and Segmentation

Performs instance segmentation and tracking of objects in videos

Generates frame-by-frame object segmentation masks

Autonomous Driving Scene Understanding

Analyzes road scene videos to identify and segment various traffic participants

🚀 Video Mask2Former

Video Mask2Former is a model for video instance segmentation, trained on YouTubeVIS - 2021. It extends the original Mask2Former and achieves state - of - the - art performance.

🚀 Quick Start

The Video Mask2Former model is trained on YouTubeVIS - 2021 instance segmentation (tiny - sized version, Swin backbone). It was introduced in the paper Mask2Former for Video Instance Segmentation and first released in this repository. Video Mask2Former is an extension of the original Mask2Former paper named Masked - attention Mask Transformer for Universal Image Segmentation.

Disclaimer: The team releasing Mask2Former did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Model description

Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance and efficiency by: (i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer; (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation; (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

In the paper Mask2Former for Video Instance Segmentation, the authors have shown that Mask2Former also achieves state - of - the - art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.

model image

Intended uses & limitations

You can use this particular checkpoint for instance segmentation. See the [model hub](https://huggingface.co/models?search=video - mask2former) to look for other fine - tuned versions of this model that may interest you.

💻 Usage Examples

Basic Usage

import torch
import torchvision
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former trained on YouTubeVIS 2021 instance segmentation
processor = AutoImageProcessor.from_pretrained("facebook/video-mask2former-swin-tiny-youtubevis-2021-instance")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask2former-swin-tiny-youtubevis-2021-instance")

file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
video = torchvision.io.read_video(file_path)[0]
video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
video_input = torch.cat(video_frames)

with torch.no_grad():
    outputs = model(**video_input)

# model predicts class_queries_logits of shape `(batch_size, num_queries, num_classes)`
# and masks_queries_logits of shape `(num_queries, batch_size, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
result = image_processor.post_process_video_instance_segmentation(outputs, target_sizes=[tuple(video.shape[1:3])])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)
predicted_video_instance_map = result["segmentation"]

For more code examples, we refer to the documentation.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	Video Mask2Former (tiny - sized version, Swin backbone)
Training Data	YouTubeVIS - 2021

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご