VideoMAE Open-Source Video Model - Free Deployment for Precise and Efficient Video Classification!

Videomae Base Finetuned Kinetics

Developed by MCG-NJU

VideoMAE is a video self-supervised pre-training model based on Masked Autoencoder (MAE), fine-tuned on the Kinetics-400 dataset for video classification tasks.

Video Processing

Transformers

#Video Self-supervised Learning #Action Recognition #Masked Autoencoding

Downloads 44.91k

Release Time : 7/8/2022

Model Overview

This model is pre-trained in a self-supervised manner and fine-tuned with supervision on the Kinetics-400 dataset, capable of classifying videos into one of 400 possible categories.

Model Features

Self-supervised Pre-training

Uses Masked Autoencoder (MAE) method for self-supervised pre-training to learn internal video representations

Efficient Video Representation

By predicting pixel values of masked video patches, the model learns effective video feature representations

Transformer Architecture

Based on Vision Transformer architecture, processes sequences of video patches, suitable for temporal video modeling

Model Capabilities

Video Classification

Video Feature Extraction

Use Cases

Video Understanding

Kinetics-400 Video Classification

Classify videos into 400 categories from the Kinetics-400 dataset

Achieves 80.9 top-1 accuracy and 94.7 top-5 accuracy on Kinetics-400 test set

🚀 VideoMAE (base-sized model, fine-tuned on Kinetics-400)

VideoMAE is a pre - trained model for video classification. It was pre - trained in a self - supervised way and fine - tuned on Kinetics - 400. It offers an effective solution for video - related downstream tasks.

🚀 Quick Start

You can use the raw model for video classification into one of the 400 possible Kinetics - 400 labels. Here is how to use this model to classify a video:

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import numpy as np
import torch

video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

inputs = processor(video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.

✨ Features

VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches.

Videos are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds fixed sinus/cosinus position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre - training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled videos for instance, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.

📚 Documentation

Evaluation results

This model obtains a top - 1 accuracy of 80.9 and a top - 5 accuracy of 94.7 on the test set of Kinetics - 400.

BibTeX entry and citation info

misc{https://doi.org/10.48550/arxiv.2203.12602,
  doi = {10.48550/ARXIV.2203.12602},
  url = {https://arxiv.org/abs/2203.12602},
  author = {Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {VideoMAE: Masked Autoencoders are Data - Efficient Learners for Self - Supervised Video Pre - Training},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

📄 License

This model is licensed under "cc - by - nc - 4.0".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご