Open-source video instance segmentation model video-mask2former-swin-tiny - Free deployment for accurate video target segmentation

Video Mask2former Swin Tiny Youtubevis 2019 Instance

Developed by shivalikasingh

A tiny video instance segmentation model trained on the YouTubeVIS-2019 dataset, utilizing the Swin Transformer backbone and Mask2Former unified segmentation architecture

Image Segmentation

Transformers

Open Source License:MIT #Video Instance Segmentation #Swin Backbone Network #Multi-scale Attention

Downloads 19

Release Time : 3/15/2023

Model Overview

This model is an implementation of Mask2Former for video instance segmentation tasks, addressing object segmentation in videos by predicting a set of masks and their corresponding labels without modifying the architecture

Model Features

Unified Segmentation Architecture

Unifies instance segmentation, semantic segmentation, and panoptic segmentation as a mask prediction problem, using the same architecture for processing

Multi-scale Deformable Attention

Employs advanced multi-scale deformable attention mechanisms to replace traditional pixel decoders, improving feature extraction efficiency

Masked Attention Decoder

Innovative Transformer decoder design with masked attention, enhancing performance without increasing computational load

Efficient Training Strategy

Significantly improves training efficiency by calculating losses based on sampled points rather than full masks

Model Capabilities

Video Object Instance Segmentation

Multi-object Tracking and Segmentation

Video Scene Understanding

Use Cases

Video Analysis

Autonomous Driving Scene Understanding

Identifies and segments dynamic objects such as vehicles and pedestrians in road scenes

Enables continuous tracking and precise segmentation of multiple objects in videos

Video Editing and Effects

Automatically separates foreground objects in videos for special effects processing

Provides precise object masks to support advanced video editing

Surveillance and Security

Intelligent Surveillance Analysis

Detects and tracks suspicious objects in surveillance videos in real-time

Supports simultaneous tracking and behavior analysis of multiple targets

🚀 Video Mask2Former

The Video Mask2Former model is trained on YouTubeVIS - 2019 instance segmentation (tiny - sized version, Swin backbone). It offers a unified approach to video instance segmentation.

🚀 Quick Start

The Video Mask2Former model was introduced in the paper Mask2Former for Video Instance Segmentation and first released in this repository. It's an extension of the original Mask2Former paper Masked - attention Mask Transformer for Universal Image Segmentation.

Disclaimer: The team releasing Mask2Former did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Unified Segmentation Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation with the same approach, treating all 3 tasks as instance segmentation by predicting a set of masks and corresponding labels.
Performance and Efficiency: It outperforms the previous SOTA, MaskFormer, both in performance and efficiency. It achieves this by:
- Replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer.
- Adopting a Transformer decoder with masked attention to boost performance without additional computation.
- Improving training efficiency by calculating the loss on subsampled points instead of whole masks.
Video Instance Segmentation: In the paper Mask2Former for Video Instance Segmentation, it's shown that Mask2Former achieves state - of - the - art performance on video instance segmentation without modifying the architecture, loss, or training pipeline.

📚 Documentation

Model description

Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

In the paper Mask2Former for Video Instance Segmentation, the authors have shown that Mask2Former also achieves state - of - the - art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.

model image

Intended uses & limitations

You can use this particular checkpoint for instance segmentation. See the [model hub](https://huggingface.co/models?search=video - mask2former) to look for other fine - tuned versions of this model that may interest you.

💻 Usage Examples

Basic Usage

import torch
import torchvision
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation


# load Mask2Former trained on YouTubeVIS 2021 instance segmentation
processor = AutoImageProcessor.from_pretrained("facebook/video-mask2former-swin-tiny-youtubevis-2019-instance")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask2former-swin-tiny-youtubevis-2019-instance")

file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
video = torchvision.io.read_video(file_path)[0]
video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
video_input = torch.cat(video_frames)

with torch.no_grad():
    outputs = model(**video_input)

# model predicts class_queries_logits of shape `(batch_size, num_queries, num_classes)`
# and masks_queries_logits of shape `(num_queries, batch_size, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
result = image_processor.post_process_video_instance_segmentation(outputs, target_sizes=[tuple(video.shape[1:3])])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)
predicted_video_instance_map = result["segmentation"]

For more code examples, we refer to the documentation.

📄 License

This project is licensed under the MIT license.

Property	Details
Model Type	Video Mask2Former (tiny - sized version, Swin backbone) trained on YouTubeVIS - 2019 instance segmentation
Training Data	YouTubeVIS - 2019
Tags	vision, image - segmentation

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご