xclip-large-patch14-kinetics-600 Open Source Model - General Video-Language Understanding, Empowering Video-Text Interaction

Xclip Large Patch14 Kinetics 600

Developed by microsoft

X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs through contrastive learning.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-text contrastive learning #Zero-shot video classification #High-precision action recognition

Downloads 124

Release Time : 9/8/2022

Model Overview

The X-CLIP model (large size, 14-patch resolution) was fully supervised trained on Kinetics-600, suitable for tasks such as video classification and video-text retrieval.

Model Features

Video-Language Understanding

Trained on video-text pairs through contrastive learning, supporting video classification and video-text retrieval.

High Accuracy

Achieves a top-1 accuracy of 88.3% and a top-5 accuracy of 97.7% on the Kinetics-400 dataset.

Multi-task Support

Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Capabilities

Video classification

Video-text retrieval

Zero-shot learning

Few-shot learning

Use Cases

Video Analysis

Video Classification

Classify video content to recognize actions or scenes in videos.

Achieves 88.3% top-1 accuracy on the Kinetics-400 dataset.

Video-Text Retrieval

Retrieve relevant videos based on text descriptions or generate descriptive text based on video content.

🚀 X-CLIP (large-sized model)

X-CLIP is a model designed for general video - language understanding. It extends CLIP and is trained on video - text pairs, enabling it to handle various video - related tasks such as classification and retrieval.

🚀 Quick Start

The X - CLIP model (large - sized, patch resolution of 14) is trained fully - supervised on Kinetics - 600. It was introduced in the paper Expanding Language - Image Pretrained Models for General Video Recognition by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X - CLIP).

This model was trained using 8 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X - CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

General Video - Language Understanding: X - CLIP is a minimal extension of CLIP for general video - language understanding. The model is trained in a contrastive way on (video, text) pairs.
Versatile Task Handling: This allows the model to be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

📚 Documentation

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine - tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

This model was trained on Kinetics - 600.

Preprocessing

The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X - CLIP/datasets/build.py#L247).

The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X - CLIP/datasets/build.py#L285).

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed - size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

Property	Details
Model Type	X - CLIP (large - sized model, patch resolution of 14)
Training Data	Kinetics - 600
Top - 1 Accuracy	88.3%
Top - 5 Accuracy	97.7%

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご