Open-Source Video Action Recognition Model videomae-base-finetuned-ucf101 - Accurately Recognize Various Video Actions

Home

Videomae Base Finetuned Ucf101

Developed by nateraw

Video action recognition model fine-tuned on UCF101 dataset based on VideoMAE Base model

Video Processing

Transformers

EnglishOpen Source License:MIT #Video Action Recognition #UCF101 Fine-tuning #16-frame Sampling

Downloads 130

Release Time : 11/10/2022

Model Overview

This model is a fine-tuned version of the VideoMAE Base model on the UCF101 dataset, primarily used for video action recognition tasks.

Model Features

Video Action Recognition

Capable of recognizing specific actions in videos, applicable to 101 action categories in the UCF101 dataset

Based on VideoMAE Architecture

Utilizes the VideoMAE Base model as the foundational architecture, featuring efficient video feature extraction

Data Augmentation Processing

Uses PyTorchVideo's MixVideo for mixup/cutmix augmentation during training

Model Capabilities

Video Action Recognition

Video Feature Extraction

Video Classification

Use Cases

Video Analysis

Action Recognition

Recognize human actions in videos

Achieves 75.8% accuracy on the UCF101 dataset

Video Content Classification

Classify video content

Top 5 accuracy reaches 89.8%

🚀 Model Card for videomae-base-finetuned-ucf101

This model card provides detailed information about the videomae-base-finetuned-ucf101 model, which is a VideoMAE Base model fine-tuned on the UCF101 dataset. It includes details on model usage, training, evaluation, and more.

🚀 Quick Start

Use the code below to get started with the model.

from decord import VideoReader, cpu
import torch
import numpy as np

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download

np.random.seed(0)


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices


# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
    repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")

inputs = feature_extractor(list(video), return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# model predicts one of the 101 UCF101 classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

✨ Features

Direct Use

This model can be used for Video Action Recognition.

📦 Installation

No installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from decord import VideoReader, cpu
import torch
import numpy as np

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download

np.random.seed(0)


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices


# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
    repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")

inputs = feature_extractor(list(video), return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# model predicts one of the 101 UCF101 classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

📚 Documentation

Model Details

Model Description

VideoMAE Base model fine tuned on UCF101

Developed by: @nateraw
Model type: fine-tuned
Language(s) (NLP): en
License: mit
Parent Model: MCG-NJU/videomae-base
Resources for more information: [More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing

We sampled clips from the videos of 64 frames, then took a uniform sample of those frames to get 16 frame inputs for the model. During training, we used PyTorchVideo's MixVideo to apply mixup/cutmix.

Speeds, Sizes, Times

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

We only trained/evaluated one fold from the UCF101 annotations. Unlike in the VideoMAE paper, we did not perform inference over multiple crops/segments of validation videos, so the results are likely slightly lower than what you would get if you did that too.

Eval Accuracy: 0.758209764957428
Eval Accuracy Top 5: 0.8983050584793091

Bias, Risks, and Limitations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Model Card Authors

@nateraw

Model Card Contact

@nateraw

🔧 Technical Details

No technical details are provided in the original README.

📄 License

This model is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご