Timesformer-hr-finetuned-k600 Open-Source Video Understanding Model - High-Resolution Fine-Tuning Empowers Video Analysis

Timesformer Hr Finetuned K600

Developed by fcakyon

TimeSformer is a video understanding model based on spatiotemporal attention mechanisms, with its high-resolution variant specifically fine-tuned for the Kinetics-600 dataset.

Video Processing

Transformers

#Video Action Recognition #Spatiotemporal Attention Mechanism #High-Resolution Processing

Downloads 22

Release Time : 12/10/2022

Model Overview

This model is primarily used for video classification tasks, supporting 600 category classifications in the Kinetics-600 dataset. It processes spatiotemporal video information using pure attention mechanisms without convolutional operations.

Model Features

Pure Attention Mechanism

Processes video data entirely based on Transformer architecture without traditional convolutional operations.

High-Resolution Support

A specially optimized high-resolution variant capable of handling more detailed video content.

Spatiotemporal Modeling

Simultaneously captures spatial and temporal dimensional information in videos.

Model Capabilities

Video Content Classification

Spatiotemporal Feature Extraction

Action Recognition

Use Cases

Video Analysis

Action Recognition

Identifies human actions and behaviors in videos.

Can recognize 600 action categories in the Kinetics-600 dataset.

Video Content Classification

Automatically classifies and tags video content.

🚀 TimeSformer (high-resolution variant, fine-tuned on Kinetics-600)

A pre-trained TimeSformer model on Kinetics-600 for video classification.

🚀 Quick Start

The TimeSformer model is pre-trained on Kinetics-600. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository.

Disclaimer: The team releasing TimeSformer did not write a model card for this model so this model card has been written by fcakyon.

✨ Features

You can use the raw model for video classification into one of the 600 possible Kinetics-600 labels.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, TimesformerForVideoClassification
import numpy as np
import torch

video = list(np.random.randn(16, 3, 448, 448))

processor = AutoImageProcessor.from_pretrained("fcakyon/timesformer-hr-finetuned-k600")
model = TimesformerForVideoClassification.from_pretrained("fcakyon/timesformer-hr-finetuned-k600")

inputs = processor(images=video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Advanced Usage

For more code examples, we refer to the documentation.

📚 Documentation

No detailed documentation other than the usage example is provided in the original document, so this section is skipped.

🔧 Technical Details

No technical implementation details are provided in the original document, so this section is skipped.

📄 License

The model is licensed under CC BY-NC 4.0.

BibTeX entry and citation info

@inproceedings{bertasius2021space,
  title={Is Space-Time Attention All You Need for Video Understanding?},
  author={Bertasius, Gedas and Wang, Heng and Torresani, Lorenzo},
  booktitle={International Conference on Machine Learning},
  pages={813--824},
  year={2021},
  organization={PMLR}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご