VideoMAE Open-source Video Classification Model - Free Deployment for Precise Video Classification Tasks

Videomae Base Finetuned Ssv2

Developed by MCG-NJU

VideoMAE is a video self-supervised pretraining model based on Masked Autoencoder (MAE), fine-tuned on the Something-Something-v2 dataset for video classification tasks.

Video Processing

Transformers

#Video Self-Supervised Learning #Action Recognition #Spatio-Temporal Feature Extraction

Downloads 951

Release Time : 8/2/2022

Model Overview

This model is pretrained in a self-supervised manner and fine-tuned in a supervised way on the Something-Something-v2 dataset, primarily for video classification tasks.

Model Features

Self-Supervised Pretraining

Uses Masked Autoencoder (MAE) method for video self-supervised pretraining, reducing reliance on labeled data

Efficient Video Representation Learning

Learns internal video representations through masking and reconstruction mechanisms, effectively extracting video features

Transformer Architecture

Based on Vision Transformer architecture, processing videos as fixed-size patch sequences

Model Capabilities

Video Classification

Video Feature Extraction

Use Cases

Video Understanding

Action Recognition

Recognizing human actions and behaviors in videos

Achieves 70.6% top-1 accuracy on Something-Something-v2 test set

🚀 VideoMAE (base-sized model, fine-tuned on Something-Something-v2)

VideoMAE is a model pre - trained in a self - supervised way and fine - tuned on Something - Something - v2. It can be used for video classification tasks.

🚀 Quick Start

VideoMAE model was pre - trained for 2400 epochs in a self - supervised way and fine - tuned in a supervised way on Something - Something - v2. It was introduced in the paper VideoMAE: Masked Autoencoders are Data - Efficient Learners for Self - Supervised Video Pre - Training by Tong et al. and first released in this repository.

Disclaimer: The team releasing VideoMAE did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches.

Videos are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. Fixed sinus/cosinus position embeddings are added before feeding the sequence to the layers of the Transformer encoder.

By pre - training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks. For example, if you have a dataset of labeled videos, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder. Usually, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.

💻 Usage Examples

Basic Usage

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import numpy as np
import torch

video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-ssv2")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-ssv2")

inputs = processor(video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.

📚 Documentation

You can use the raw model for video classification into one of the 400 possible Kinetics - 400 labels.

🔧 Technical Details

This model obtains a top - 1 accuracy of 70.6 and a top - 5 accuracy of 92.6 on the test set of Something - Something - v2.

BibTeX entry and citation info

misc{https://doi.org/10.48550/arxiv.2203.12602,
  doi = {10.48550/ARXIV.2203.12602},
  url = {https://arxiv.org/abs/2203.12602},
  author = {Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {VideoMAE: Masked Autoencoders are Data - Efficient Learners for Self - Supervised Video Pre - Training},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

📄 License

This model is licensed under "cc - by - nc - 4.0".

Property	Details
License	cc - by - nc - 4.0
Tags	vision, video - classification

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご