VideoMAEv2-Huge Open-Source Video Feature Extraction Model - Efficient and Accurate Extraction of Key Video Features

Videomaev2 Huge

Developed by OpenGVLab

VideoMAEv2-Huge is a self-supervised learning-based video feature extraction model, pre-trained for 1200 epochs on the UnlabeledHybrid-1M dataset.

Video Processing

Safetensors

#Video Self-Supervised Learning #Large-Scale Pre-Training #Dual-Masking Strategy

Downloads 1,145

Release Time : 1/14/2025

Model Overview

This model is primarily used for video feature extraction, employing a dual-masking strategy for pre-training, effectively capturing spatiotemporal features in videos.

Model Features

Dual-Masking Pre-Training Strategy

Employs a dual-masking strategy for self-supervised learning, enhancing the model's understanding of spatiotemporal features in videos.

Large-Scale Pre-Training

Pre-trained for 1200 epochs on the UnlabeledHybrid-1M dataset, learning rich video feature representations.

Efficient Feature Extraction

Capable of extracting meaningful spatiotemporal features from videos, suitable for downstream video understanding tasks.

Model Capabilities

Video Feature Extraction

Video Classification

Video Understanding

Use Cases

Video Analysis

Video Content Classification

Classify video content, such as action recognition, scene recognition, etc.

Video Retrieval

Extract video features for similar video retrieval.

🚀 VideoMAE-v2 (Huge-sized model, Pretrained on UnlabeledHybrid-1M)

VideoMAEv2-Huge model pre-trained for 1200 epochs in a self-supervised way on UnlabeldHybrid-1M dataset. It can be used for video feature extraction.

🚀 Quick Start

VideoMAEv2-Huge model is pre-trained for 1200 epochs in a self-supervised way on UnlabeldHybrid-1M dataset. It was introduced in the paper [CVPR23]VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking by Wang et al. and first released in GitHub.

✨ Features

You can use the raw model for video feature extraction.

📦 Installation

No specific installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import VideoMAEImageProcessor, AutoModel, AutoConfig
import numpy as np
import torch


config = AutoConfig.from_pretrained("OpenGVLab/VideoMAEv2-Huge", trust_remote_code=True)
processor = VideoMAEImageProcessor.from_pretrained("OpenGVLab/VideoMAEv2-Huge")
model = AutoModel.from_pretrained('OpenGVLab/VideoMAEv2-Huge', config=config, trust_remote_code=True)


video = list(np.random.rand(16, 3, 224, 224))

# B, T, C, H, W -> B, C, T, H, W
inputs = processor(video, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].permute(0, 2, 1, 3, 4)

with torch.no_grad():
  outputs = model(**inputs)

📚 Documentation

BibTeX entry and citation info

@InProceedings{wang2023videomaev2,
    author    = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
    title     = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {14549-14560}
}

@misc{videomaev2,
      title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
      author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao},
      year={2023},
      eprint={2303.16727},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This project is licensed under the CC BY-NC 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご