X-CLIP Open-Source Model - A Free Tool for Efficient General Video and Language Understanding

Xclip Base Patch16 16 Frames

Developed by microsoft

X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Zero-Shot Video Classification #Multimodal Video Understanding

Downloads 1,034

Release Time : 9/7/2022

Model Overview

This model can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Model Features

Video-Language Understanding

Trained via contrastive learning on (video, text) pairs, supporting video-text matching.

Multi-Task Support

Can be used for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Efficient Training

Uses 16 frames per video during training at 224x224 resolution, optimizing computational efficiency.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-Shot Learning

Few-Shot Learning

Use Cases

Video Analysis

Video Classification

Classify video content, such as action recognition, scene recognition, etc.

Achieves 84.7% top-1 accuracy and 96.8% top-5 accuracy on the Kinetics-400 dataset.

Video-Text Retrieval

Retrieve relevant videos based on text descriptions or generate matching text descriptions based on video content.

🚀 X-CLIP (base-sized model)

X-CLIP is a minimal extension of CLIP for general video - language understanding. It can be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

🚀 Quick Start

The X - CLIP model (base - sized, patch resolution of 16) is trained fully - supervised on Kinetics - 400. It was introduced in the paper Expanding Language - Image Pretrained Models for General Video Recognition by Ni et al. and first released in [this repository](https://github.com/microsoft/VideoX/tree/master/X - CLIP). This model was trained using 16 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X - CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

X - CLIP is a minimal extension of CLIP for general video - language understanding.
The model is trained in a contrastive way on (video, text) pairs, allowing it to be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

📚 Documentation

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine - tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

This model was trained on Kinetics - 400.

Preprocessing

The exact details of preprocessing during training can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X - CLIP/datasets/build.py#L247). The exact details of preprocessing during validation can be found [here](https://github.com/microsoft/VideoX/blob/40f6d177e0a057a50ac69ac1de6b5938fd268601/X - CLIP/datasets/build.py#L285). During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed - size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

Property	Details
Model Type	nielsr/xclip - base - patch16 - 16 - frames
Training Data	Kinetics - 400
Top - 1 Accuracy	84.7%
Top - 5 Accuracy	96.8%

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご