# Spatiotemporal Modeling
Vivit B 16x2 Kinetics400
MIT
ViViT is an extension of the Vision Transformer (ViT) for video processing, particularly suitable for video classification tasks.
Video Processing
Transformers

V
google
56.94k
32
Vivit B 16x2
MIT
ViViT is an extension of the Vision Transformer (ViT) for video processing, primarily used for downstream tasks such as video classification.
Video Processing
Transformers

V
google
989
11
Videomae Large
VideoMAE is a video self-supervised pre-training model based on Masked Autoencoder (MAE), which learns video representations by predicting pixel values of masked video patches
Video Processing
Transformers

V
MCG-NJU
3,243
31
Video Classification Cnn Rnn
A hybrid CNN-RNN architecture-based video classification model for action recognition tasks
Video Processing
V
keras-io
57
14
Featured Recommended AI Models