VideoMAE-small-finetuned-ssv2 Open-Source Video Classification Model - Free Deployment for Precise Video Category Identification

Videomae Small Finetuned Ssv2

Developed by MCG-NJU

VideoMAE is a self-supervised pretrained video model based on Masked Autoencoder (MAE), fine-tuned on the Something-Something V2 dataset for video classification tasks.

Video Processing

Transformers

#Video Action Recognition #Self-supervised Pretraining #SSV2 Dataset

Downloads 140

Release Time : 4/16/2023

Model Overview

This model was pretrained in a self-supervised manner for 2400 epochs and then supervised fine-tuned on the Something-Something V2 dataset, capable of classifying videos into one of 174 labels.

Model Features

Self-supervised Pretraining

Utilizes Masked Autoencoder (MAE) method for self-supervised pretraining, effectively learning internal video representations

Efficient Video Processing

Processes videos into fixed-size patch sequences, efficiently handled via Transformer architecture

SSV2 Dataset Fine-tuning

Fine-tuned on the Something-Something V2 dataset, specifically designed for action recognition tasks

Model Capabilities

Video Classification

Action Recognition

Feature Extraction

Use Cases

Video Understanding

Action Recognition

Identify human actions and behaviors in videos

Achieves 66.8% top-1 accuracy on the SSV2 test set

Video Content Analysis

Analyze video content and automatically classify

🚀 VideoMAE (small-sized model, fine-tuned on SSV2)

A VideoMAE model pre-trained for 2400 epochs in a self-supervised way and fine-tuned in a supervised way on Something-Something V2.

This model was introduced in the paper VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training by Tong et al. and first released in this repository.

Disclaimer: The team releasing VideoMAE did not write a model card for this model, so this model card has been written by the Hugging Face team.

🚀 Quick Start

You can use the raw model for video classification into one of the 174 possible Something-Something V2 labels.

✨ Features

VideoMAE is an extension of Masked Autoencoders (MAE) to video. The architecture of the model is very similar to that of a standard Vision Transformer (ViT), with a decoder on top for predicting pixel values for masked patches.

Videos are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. Fixed sinus/cosinus position embeddings are added before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of videos that can then be used to extract features useful for downstream tasks. For example, if you have a dataset of labeled videos, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. Typically, a linear layer is placed on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire video.

💻 Usage Examples

Basic Usage

Here is how to use this model to classify a video:

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import numpy as np
import torch

video = list(np.random.randn(16, 3, 224, 224))

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-small-finetuned-ssv2")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-small-finetuned-ssv2")

inputs = feature_extractor(video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.

📚 Documentation

Evaluation results

This model obtains a top-1 accuracy of 66.8 and a top-5 accuracy of 90.3 on the test set of Something-Something V2.

BibTeX entry and citation info

misc{https://doi.org/10.48550/arxiv.2203.12602,
  doi = {10.48550/ARXIV.2203.12602},
  url = {https://arxiv.org/abs/2203.12602},
  author = {Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

📄 License

This model is licensed under "cc-by-nc-4.0".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご