xclip-large-patch14 Open-source Model - Free Support for General Video and Language Understanding Applications

Xclip Large Patch14

Developed by microsoft

X-CLIP is an extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Zero-shot Video Classification #High-precision Action Recognition

Downloads 1,698

Release Time : 9/7/2022

Model Overview

The X-CLIP model (large size, 14×14 patch resolution) is fully supervised trained on the Kinetics-400 dataset and can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Features

Video-Language Understanding

Trained via contrastive learning on (video, text) pairs, supporting video and text matching.

High Accuracy

Achieves Top-1 accuracy of 87.1% and Top-5 accuracy of 97.6% on the Kinetics-400 dataset.

Multi-task Support

Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-shot Learning

Few-shot Learning

Use Cases

Video Analysis

Video Classification

Classify video content, such as recognizing actions, scenes, etc.

Top-1 accuracy 87.1%, Top-5 accuracy 97.6%.

Video-Text Retrieval

Retrieve relevant video clips based on text descriptions.

🚀 X-CLIP (Large-sized Model)

X-CLIP is a model for general video-language understanding, trained fully-supervised on Kinetics-400, enabling tasks like video classification and video-text retrieval.

🚀 Quick Start

X-CLIP (large-sized, patch resolution of 14) is a model trained fully-supervised on Kinetics-400. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository. This model was trained using 8 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

X-CLIP is a minimal extension of CLIP for general video-language understanding.
The model is trained in a contrastive way on (video, text) pairs, allowing it to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.

X-CLIP architecture

📚 Documentation

Intended Uses & Limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you.

How to Use

For code examples, we refer to the documentation.

Training Data

This model was trained on Kinetics-400.

Preprocessing

The exact details of preprocessing during training can be found here.
The exact details of preprocessing during validation can be found here.

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation Results

This model achieves a top-1 accuracy of 87.1% and a top-5 accuracy of 97.6%.

📄 License

This project is licensed under the MIT license.

📦 Model Information

Property	Details
Model Type	X-CLIP (large-sized, patch resolution of 14)
Training Data	Kinetics-400
Top-1 Accuracy	87.1%
Top-5 Accuracy	97.6%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご