đ Video Mask2Former
The Video Mask2Former model is trained on YouTubeVIS - 2021 instance segmentation (large - sized version, Swin backbone). It offers a unified approach to address various segmentation tasks.
đ Quick Start
The Video Mask2Former model, trained on YouTubeVIS - 2021 instance segmentation, is an extension of the original Mask2Former. It was introduced in the paper Mask2Former for Video Instance Segmentation and first released in this repository.
Disclaimer: The team releasing Mask2Former did not write a model card for this model, so this model card has been written by the Hugging Face team.
⨠Features
- Unified Segmentation Paradigm: Mask2Former addresses instance, semantic, and panoptic segmentation using the same approach by predicting a set of masks and corresponding labels.
- Performance and Efficiency: It outperforms the previous SOTA, MaskFormer, in both performance and efficiency through multiple improvements.
- Video Instance Segmentation: Achieves state - of - the - art performance on video instance segmentation without modifying the architecture, loss, or training pipeline.
đ Documentation
Model description
Mask2Former addresses instance, semantic, and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance and efficiency by:
(i) replacing the pixel decoder with a more advanced multi - scale deformable attention Transformer;
(ii) adopting a Transformer decoder with masked attention to boost performance without introducing additional computation;
(iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.
In the paper Mask2Former for Video Instance Segmentation, the authors have shown that Mask2Former also achieves state - of - the - art performance on video instance segmentation without modifying the architecture, the loss, or even the training pipeline.

Intended uses & limitations
You can use this particular checkpoint for instance segmentation. See the [model hub](https://huggingface.co/models?search=video - mask2former) to look for other fine - tuned versions of this model that may interest you.
đģ Usage Examples
Basic Usage
import torch
import torchvision
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation
processor = AutoImageProcessor.from_pretrained("facebook/video-mask2former-swin-large-youtubevis-2021-instance")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask2former-swin-large-youtubevis-2021-instance")
file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
video = torchvision.io.read_video(file_path)[0]
video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
video_input = torch.cat(video_frames)
with torch.no_grad():
outputs = model(**video_input)
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits
result = image_processor.post_process_video_instance_segmentation(outputs, target_sizes=[tuple(video.shape[1:3])])[0]
predicted_video_instance_map = result["segmentation"]
For more code examples, we refer to the documentation.
đ License
This model is released under the MIT license.
Property |
Details |
Model Type |
Video Mask2Former (large - sized version, Swin backbone) |
Training Data |
YouTubeVIS - 2021 |