Videomae - Base Open - Source Video Model - Free Deployment to Learn Video Internal Representations and Predict Pixel Values

Videomae Base

Developed by MCG-NJU

VideoMAE is a video self-supervised pretraining model based on Masked Autoencoder (MAE), which learns internal video representations by predicting pixel values of masked video patches.

Video Processing

Transformers

#Video Self-supervised Learning #Masked Autoencoder #Video Feature Extraction

Downloads 48.66k

Release Time : 8/3/2022

Model Overview

This model extends the Masked Autoencoder to the video domain, employing a Vision Transformer architecture with an added decoder for predicting pixel values of masked patches. Primarily used for video feature extraction and downstream task fine-tuning.

Model Features

Video Self-supervised Learning

Achieves unsupervised pretraining through masked video patch prediction tasks, reducing reliance on labeled data

Efficient Data Utilization

Learns effective video representations with less data compared to traditional methods

Flexible Downstream Applications

Pretrained model can be fine-tuned for various video understanding tasks

Model Capabilities

Video Feature Extraction

Masked Patch Pixel Prediction

Video Representation Learning

Use Cases

Video Understanding

Video Classification

Add classification layers on top of the pretrained model for fine-tuning

Action Recognition

Recognize specific actions using learned video representations

🚀 VideoMAE (base-sized model, pre-trained only)

A self-supervised pre-trained VideoMAE model on Kinetics-400 for 1600 epochs, introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.

🚀 Quick Start

This section provides a quick overview of the VideoMAE model and how to use it.

✨ Features

Video Extension: VideoMAE extends Masked Autoencoders (MAE) to video, with an architecture similar to a standard Vision Transformer (ViT).
Self-Supervised Learning: Through pre-training, it learns an inner representation of videos for downstream tasks.
Feature Extraction: Useful for extracting features for tasks like video classification.

📚 Documentation

Model description

VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches.

Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled videos for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.

Intended uses & limitations

You can use the raw model for predicting pixel values for masked patches of a video, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

Here is how to use this model to predict pixel values for randomly masked patches:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
loss = outputs.loss

For more code examples, we refer to the documentation.

📄 License

This model is licensed under the "cc-by-nc-4.0" license.

BibTeX entry and citation info

misc{https://doi.org/10.48550/arxiv.2203.12602,
  doi = {10.48550/ARXIV.2203.12602},
  url = {https://arxiv.org/abs/2203.12602},
  author = {Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご