video - mask2former开源视频实例分割模型 - 精准分割视频，免费部署超实用

首页

Video Mask2former Swin Large Youtubevis 2021 Instance

由 shivalikasingh 开发

基于YouTubeVIS-2021数据集训练的视频实例分割模型，采用Swin Transformer骨干网络和Mask2Former统一分割架构

图像分割

Transformers

开源协议:MIT #视频实例分割 #Swin骨干网络 #多尺度注意力

下载量 52

发布时间 : 3/22/2023

模型简介

该模型通过预测一组掩码及对应标签实现视频实例分割，采用Transformer架构统一处理分割任务，在性能和效率上超越前代模型

模型特点

统一分割架构

将实例分割、语义分割和全景分割统一视为实例分割问题处理

高效注意力机制

采用多尺度可变形注意力Transformer替代传统像素解码器

掩码注意力解码器

引入带掩码注意力的Transformer解码器提升性能而不增加计算量

高效训练策略

通过在采样点上计算损失而非整张掩码，显著提升训练效率

模型能力

视频实例分割

多目标跟踪

动态场景分析

使用案例

视频分析

自动驾驶场景理解

实时分析道路场景中的动态物体实例

可准确分割并跟踪移动车辆、行人等目标

视频监控

监控视频中的多目标检测与跟踪

支持长时间跨帧的实例一致性保持

🚀 Video Mask2Former

Video Mask2Former 是一个在 YouTubeVIS-2021 实例分割数据集上训练的模型（大尺寸版本，采用 Swin 主干网络）。它源自论文 Mask2Former for Video Instance Segmentation，并首次在此仓库中发布。Video Mask2Former 是原始 Mask2Former 论文 Masked-attention Mask Transformer for Universal Image Segmentation 的扩展。

需要说明的是，发布 Mask2Former 的团队并未为此模型编写模型卡片，本模型卡片由 Hugging Face 团队编写。

✨ 主要特性

统一范式：Mask2Former 通过预测一组掩码和相应的标签，以相同的范式处理实例分割、语义分割和全景分割任务，将这 3 种任务都视为实例分割。
性能卓越：相较于之前的 SOTA 模型 MaskFormer，Mask2Former 在性能和效率上均有提升。具体通过以下方式实现：用更先进的多尺度可变形注意力 Transformer 替代像素解码器；采用带掩码注意力的 Transformer 解码器，在不增加额外计算量的情况下提升性能；通过在子采样点上计算损失而非整个掩码，提高训练效率。
视频分割表现出色：在论文 Mask2Former for Video Instance Segmentation 中，作者表明 Mask2Former 在不修改架构、损失函数甚至训练流程的情况下，在视频实例分割任务上也达到了先进水平。

📚 详细文档

模型描述

Mask2Former 以相同的范式处理实例分割、语义分割和全景分割任务，即通过预测一组掩码和相应的标签，将这 3 种任务都当作实例分割来处理。它在性能和效率上超越了之前的 SOTA 模型 MaskFormer，主要通过以下几点改进：

用更先进的多尺度可变形注意力 Transformer 替代像素解码器。
采用带掩码注意力的 Transformer 解码器，在不引入额外计算的情况下提升性能。
通过在子采样点上计算损失而非整个掩码，提高训练效率。

在论文 Mask2Former for Video Instance Segmentation 中，作者证明了 Mask2Former 在不改变架构、损失函数和训练流程的情况下，在视频实例分割任务上也能取得先进的性能。

模型图片

预期用途与限制

你可以使用此特定检查点进行实例分割。你可以访问模型中心查找该模型其他可能令你感兴趣的微调版本。

💻 使用示例

基础用法

以下是如何使用此模型的示例代码：

import torch
import torchvision
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation

# load Mask2Former trained on YouTubeVIS 2021 instance segmentation
processor = AutoImageProcessor.from_pretrained("facebook/video-mask2former-swin-large-youtubevis-2021-instance")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask2former-swin-large-youtubevis-2021-instance")

file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
video = torchvision.io.read_video(file_path)[0]
video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
video_input = torch.cat(video_frames)

with torch.no_grad():
    outputs = model(**video_input)

# model predicts class_queries_logits of shape `(batch_size, num_queries, num_classes)`
# and masks_queries_logits of shape `(num_queries, batch_size, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

# you can pass them to processor for postprocessing
result = image_processor.post_process_video_instance_segmentation(outputs, target_sizes=[tuple(video.shape[1:3])])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the Mask2Former docs)
predicted_video_instance_map = result["segmentation"]

如需更多代码示例，请参考文档。

📄 许可证

本项目采用 MIT 许可证。

属性	详情
模型类型	Video Mask2Former 模型，在 YouTubeVIS-2021 实例分割数据集上训练（大尺寸版本，Swin 主干网络）
训练数据	YouTubeVIS-2021
标签	视觉、图像分割