VideoMAEv2-Base Open-source Video Feature Extraction Model – Freely Empower Effortless Video Feature Extraction

Videomaev2 Base

Developed by OpenGVLab

VideoMAEv2-Base is a self-supervised video feature extraction model that employs a dual masking mechanism pre-trained on the UnlabeldHybrid-1M dataset.

Video Processing

Safetensors

#Video feature extraction #Self-supervised learning #Dual masking mechanism

Downloads 3,565

Release Time : 1/14/2025

Model Overview

This model learns video feature representations through self-supervision and can be applied to downstream tasks such as video classification.

Model Features

Dual masking mechanism

Utilizes an innovative dual masking strategy to enhance video representation learning

Self-supervised pre-training

Pre-trained on the UnlabeldHybrid-1M dataset via self-supervision

Video feature extraction

Feature extraction capability specifically optimized for video data

Model Capabilities

Video feature extraction

Video representation learning

Use Cases

Video analysis

Video classification

Extract video features for classification tasks

Video retrieval

Content-based video retrieval systems

🚀 VideoMAE-v2 (base-sized model, Pretrained on UnlabeledHybrid-1M)

The VideoMAEv2-Base model is pre-trained in a self-supervised manner for 800 epochs on the UnlabeldHybrid-1M dataset. It offers a powerful solution for video classification and feature extraction.

🚀 Quick Start

The VideoMAEv2-Base model was pre-trained for 800 epochs in a self-supervised way on the UnlabeldHybrid-1M dataset. It was introduced in the paper [CVPR23]VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking by Wang et al. and first released in GitHub.

✨ Features

The model can be used for video feature extraction.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Here is how to use this model to extract a video feature:

from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig
import numpy as np
import torch


config = AutoConfig.from_pretrained("OpenGVLab/VideoMAEv2-Base", trust_remote_code=True)
processor = VideoMAEImageProcessor.from_pretrained("OpenGVLab/VideoMAEv2-Base")
model = AutoModel.from_pretrained('OpenGVLab/VideoMAEv2-Base', config=config, trust_remote_code=True)


video = list(np.random.rand(16, 3, 224, 224))

# B, T, C, H, W -> B, C, T, H, W
inputs = processor(video, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4)

with torch.no_grad():
  outputs = model(**inputs)

📚 Documentation

Intended uses & limitations

You can use the raw model for video feature extraction.

BibTeX entry and citation info

@InProceedings{wang2023videomaev2,
    author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
    title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14549-14560}
}

@misc{videomaev2,
      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
      year={2023},
      eprint={2303.16727},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This project is licensed under the CC BY-NC 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご