X-CLIP Open-source Model - Free Assistance for General Video-Language Understanding and Efficient Audio-Text Analysis

Xclip Base Patch16 Kinetics 600

Developed by microsoft

X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Zero-Shot Video Classification #High-Accuracy Action Recognition

Downloads 294

Release Time : 9/8/2022

Model Overview

This model is the base-size X-CLIP model, using a 16x16 patch resolution and trained with full supervision on the Kinetics-600 dataset. Suitable for video classification and video-text retrieval tasks.

Model Features

Video-Language Understanding

Trained on video and text pairs via contrastive learning, supporting matching judgment between videos and texts.

High Accuracy

Achieves 85.3% top-1 accuracy and 97.1% top-5 accuracy on the Kinetics 400 dataset.

Zero-Shot and Few-Shot Learning

Supports zero-shot, few-shot, or fully supervised video classification tasks.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-Shot Learning

Few-Shot Learning

Use Cases

Video Analysis

Video Content Classification

Classify video content to identify actions or scenes in videos.

Performs excellently on the Kinetics 400 dataset.

Video-Text Matching

Determine whether a given text matches the video content.

🚀 X-CLIP (base-sized model)

X-CLIP is a minimal extension of CLIP for general video - language understanding, trained on (video, text) pairs to enable tasks like video classification and video - text retrieval.

🚀 Quick Start

This X-CLIP model (base-sized, patch resolution of 16) is fully-supervised trained on Kinetics-600. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository.

This model was trained using 8 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

X-CLIP is a minimal extension of CLIP for general video - language understanding.
The model is trained in a contrastive way on (video, text) pairs, allowing it to be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

X-CLIP architecture

📚 Documentation

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine - tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

This model was trained on Kinetics-600.

Preprocessing

The exact details of preprocessing during training can be found here.
The exact details of preprocessing during validation can be found here.
During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed - size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

This model achieves a top - 1 accuracy of 85.3% and a top - 5 accuracy of 97.1%.

📄 License

This model is released under the MIT license.

📦 Model Information

Property	Details
Model Type	X-CLIP (base-sized model)
Training Data	Kinetics-600
Task	Video Classification
Dataset	Kinetics 400
Top-1 Accuracy	85.3%
Top-5 Accuracy	97.1%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご