Open-source xclip-base-patch16 model - For video-language understanding, supporting video classification and retrieval

Xclip Base Patch16

Developed by microsoft

X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs, suitable for tasks like video classification and video-text retrieval.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-text contrastive learning #Zero-shot video classification #Multimodal understanding

Downloads 1,647

Release Time : 9/7/2022

Model Overview

The X-CLIP model (base-scale, 16x16 patch resolution) was fully supervised trained on Kinetics-400, suitable for zero-shot, few-shot, or fully supervised video classification tasks.

Model Features

Video-language understanding

Trained via contrastive learning on (video, text) pairs, supporting video-text matching and understanding.

Multi-task support

Suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

High performance

Achieves top-1 accuracy of 83.8% and top-5 accuracy of 95.7% on the Kinetics-400 dataset.

Model Capabilities

Video classification

Video-text retrieval

Zero-shot learning

Few-shot learning

Use Cases

Video analysis

Video content classification

Classify video content to recognize actions or scenes in videos.

Achieves 83.8% top-1 accuracy on the Kinetics-400 dataset.

Video-text retrieval

Retrieve relevant videos based on text descriptions or generate descriptive text from video content.

🚀 X-CLIP (base-sized model)

The X-CLIP model (base-sized, patch resolution of 16) is designed for video classification and trained on the Kinetics-400 dataset, offering high accuracy in video recognition tasks.

🚀 Quick Start

The X-CLIP model (base-sized with a patch resolution of 16) is fully-supervised trained on Kinetics-400. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository.

This model was trained using 8 frames per video, at a resolution of 224x224.

Disclaimer: The team releasing X-CLIP did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs.

X-CLIP architecture

This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval.

📚 Documentation

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

🔧 Technical Details

Training data

This model was trained on Kinetics-400.

Preprocessing

The exact details of preprocessing during training can be found here.

The exact details of preprocessing during validation can be found here.

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

📄 License

This model is released under the MIT license.

📊 Evaluation Results

Property	Details
Top-1 Accuracy	83.8%
Top-5 Accuracy	95.7%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご