VideoMAE Open-Source Video Model - Free Deployment to Boost Video Classification Task Applications

Videomae Small Finetuned Kinetics

Developed by MCG-NJU

VideoMAE is a masked autoencoder model for video, pretrained with self-supervision and fine-tuned on the Kinetics-400 dataset, suitable for video classification tasks.

Video Processing

Transformers

#Video Classification #Masked Autoencoding #Self-supervised Learning

Downloads 2,152

Release Time : 4/16/2023

Model Overview

This model is based on a masked autoencoder architecture, specifically designed for video classification tasks, capable of recognizing 400 action categories in the Kinetics-400 dataset.

Model Features

Self-supervised Pretraining

Learns internal video representations through 1600 epochs of self-supervised pretraining.

Efficient Video Classification

After fine-tuning on the Kinetics-400 dataset, it can accurately recognize 400 action categories.

Masked Autoencoder Architecture

Uses a masked autoencoder approach for video pretraining, improving data efficiency.

Model Capabilities

Video Classification

Action Recognition

Video Feature Extraction

Use Cases

Video Content Analysis

Action Recognition

Recognize human actions in videos

Achieves 79.0 top-1 accuracy on the Kinetics-400 test set

Video Classification

Classify videos into 400 predefined categories

Achieves 93.8 top-5 accuracy on the Kinetics-400 test set

🚀 VideoMAE (small-sized model, fine-tuned on Kinetics-400)

A VideoMAE model pre-trained self-supervised for 1600 epochs and fine-tuned on Kinetics-400 in a supervised way.

🚀 Quick Start

The VideoMAE model was pre-trained for 1600 epochs in a self - supervised manner and then fine - tuned on Kinetics - 400 in a supervised way. It was introduced in the paper VideoMAE: Masked Autoencoders are Data - Efficient Learners for Self - Supervised Video Pre - Training by Tong et al. and first released in [this repository](https://github.com/MCG - NJU/VideoMAE).

Disclaimer: The team releasing VideoMAE did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Model description

VideoMAE is an extension of Masked Autoencoders (MAE) to video. The model's architecture is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches.

Videos are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. A [CLS] token is added to the beginning of a sequence for classification tasks. Fixed sinus/cosinus position embeddings are also added before feeding the sequence to the layers of the Transformer encoder.

Through pre - training, the model learns an inner representation of videos, which can be used to extract features for downstream tasks. For example, if you have a labeled video dataset, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.

Intended uses & limitations

You can use the raw model for video classification into one of the 400 possible Kinetics - 400 labels.

💻 Usage Examples

Basic Usage

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import numpy as np
import torch

video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-small-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-small-finetuned-kinetics")

inputs = processor(video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.

📚 Documentation

Evaluation results

This model obtains a top - 1 accuracy of 79.0 and a top - 5 accuracy of 93.8 on the test set of Kinetics - 400.

BibTeX entry and citation info

misc{https://doi.org/10.48550/arxiv.2203.12602,
  doi = {10.48550/ARXIV.2203.12602},
  url = {https://arxiv.org/abs/2203.12602},
  author = {Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

📄 License

The license of this model is cc - by - nc - 4.0.

Property	Details
License	cc-by-nc-4.0
Tags	vision, video-classification

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご