đ Model Card for videomae-base-finetuned-ucf101
This model card provides detailed information about the videomae-base-finetuned-ucf101
model, which is a VideoMAE Base model fine-tuned on the UCF101 dataset. It includes details on model usage, training, evaluation, and more.
đ Quick Start
Use the code below to get started with the model.
from decord import VideoReader, cpu
import torch
import numpy as np
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download
np.random.seed(0)
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
file_path = hf_hub_download(
repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
inputs = feature_extractor(list(video), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
⨠Features
Direct Use
This model can be used for Video Action Recognition.
đĻ Installation
No installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
from decord import VideoReader, cpu
import torch
import numpy as np
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download
np.random.seed(0)
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
file_path = hf_hub_download(
repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
inputs = feature_extractor(list(video), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])
đ Documentation
Model Details
Model Description
VideoMAE Base model fine tuned on UCF101
- Developed by: @nateraw
- Model type: fine-tuned
- Language(s) (NLP): en
- License: mit
- Parent Model: MCG-NJU/videomae-base
- Resources for more information: [More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing
We sampled clips from the videos of 64 frames, then took a uniform sample of those frames to get 16 frame inputs for the model. During training, we used PyTorchVideo's MixVideo
to apply mixup/cutmix.
Speeds, Sizes, Times
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
We only trained/evaluated one fold from the UCF101 annotations. Unlike in the VideoMAE paper, we did not perform inference over multiple crops/segments of validation videos, so the results are likely slightly lower than what you would get if you did that too.
- Eval Accuracy: 0.758209764957428
- Eval Accuracy Top 5: 0.8983050584793091
Bias, Risks, and Limitations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Model Card Authors
@nateraw
Model Card Contact
@nateraw
đ§ Technical Details
No technical details are provided in the original README.
đ License
This model is licensed under the MIT license.