X-CLIP Open-Source Model - For Video-Language Understanding, Supporting Video Classification and Text Retrieval Tasks

Xclip Base Patch32 16 Frames

Developed by microsoft

X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Zero-shot Video Classification #Multimodal Video Understanding

Downloads 901

Release Time : 9/7/2022

Model Overview

The X-CLIP model (base size, 32-pixel patch resolution) was fully supervised trained on the Kinetics-400 dataset, supporting zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Features

Video-Language Understanding

Trained on video-text pairs via contrastive learning, supporting matching and understanding between videos and text.

High Accuracy

Achieves 81.1% top-1 accuracy and 95.5% top-5 accuracy on the Kinetics-400 dataset.

Multi-task Support

Suitable for zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-shot Learning

Few-shot Learning

Use Cases

Video Analysis

Video Classification

Classify video content to identify actions or scenes in videos.

Achieves 81.1% top-1 accuracy on the Kinetics-400 dataset.

Video-Text Retrieval

Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.

🚀 X-CLIP (base-sized model)

X-CLIP is a minimal extension of CLIP for general video - language understanding. It can be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

🚀 Quick Start

X-CLIP model (base - sized, patch resolution of 32) was trained fully - supervised on Kinetics - 400. It was introduced in the paper Expanding Language - Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository.

This model was trained using 16 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X - CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

X - CLIP is a minimal extension of CLIP for general video - language understanding.
The model is trained in a contrastive way on (video, text) pairs, which allows it to be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

📚 Documentation

Model description

X - CLIP is a minimal extension of CLIP for general video - language understanding. The model is trained in a contrastive way on (video, text) pairs, enabling it to handle various video - related tasks.

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine - tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

This model was trained on Kinetics - 400.

Preprocessing

The exact details of preprocessing during training can be found here. The exact details of preprocessing during validation can be found here. During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed - size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

This model achieves a top - 1 accuracy of 81.1% and a top - 5 accuracy of 95.5%.

📄 License

This model is released under the MIT license.

📦 Model Information

Property	Details
Model Type	X - CLIP (base - sized model)
Training Data	Kinetics - 400
Model Results	Task: Video - classification Dataset: Kinetics 400 Metrics: Top - 1 accuracy: 81.1% Top - 5 accuracy: 95.5%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご