xclip-large-patch14-16-frames Open-source Video-Language Model - Free Implementation of Video Classification and Text Retrieval

Xclip Large Patch14 16 Frames

Developed by microsoft

X-CLIP is an extension of CLIP for general video-language understanding, achieving video classification and video-text retrieval tasks through contrastive learning.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Zero-shot Video Classification #High-precision Action Recognition

Downloads 678

Release Time : 9/7/2022

Model Overview

The X-CLIP model (large, 14-pixel patch resolution) was fully supervised trained on Kinetics-400, supporting zero-shot, few-shot, or fully supervised video classification and video-text retrieval tasks.

Model Features

Video-Language Contrastive Learning

Trained via contrastive learning with (video, text) pairs, supporting video-text matching tasks.

High-Resolution Processing

Uses 16 frames per video segment during training at 336x336 resolution to ensure detail capture capability.

General Video Understanding

Applicable to various video understanding tasks, including classification and retrieval.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-shot Learning

Few-shot Learning

Use Cases

Video Content Analysis

Video Classification

Classify video content, such as recognizing actions, scenes, etc.

Top-1 accuracy 87.7%, Top-5 accuracy 97.4%.

Video-Text Retrieval

Retrieve relevant video clips based on text descriptions.

🚀 X-CLIP (large-sized model)

X-CLIP is a model designed for general video - language understanding. It expands upon the CLIP model and is trained on the Kinetics - 400 dataset, enabling it to handle tasks such as video classification and video - text retrieval.

🚀 Quick Start

For code examples, we refer to the documentation.

✨ Features

General Video - Language Understanding: X - CLIP is a minimal extension of CLIP, trained in a contrastive way on (video, text) pairs. This allows it to be used for zero - shot, few - shot or fully supervised video classification and video - text retrieval.
Trained on Kinetics - 400: The model is fully - supervised trained on the Kinetics - 400 dataset, enhancing its performance in video - related tasks.

📚 Documentation

Model description

X - CLIP is a minimal extension of CLIP for general video - language understanding. The model is trained in a contrastive way on (video, text) pairs.

X - CLIP architecture

This allows the model to be used for tasks like zero - shot, few - shot or fully supervised video classification and video - text retrieval.

Intended uses & limitations

You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine - tuned versions on a task that interests you.

Training data

This model was trained on Kinetics - 400.

Preprocessing

The exact details of preprocessing during training can be found here.

The exact details of preprocessing during validation can be found here.

During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed - size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

This model achieves a top - 1 accuracy of 87.7% and a top - 5 accuracy of 97.4%.

📄 License

This model is released under the MIT license.

📦 Model Information

Property	Details
Model Name	nielsr/xclip-large-patch14-16-frames
Task	Video Classification
Dataset	Kinetics 400
Top - 1 Accuracy	87.7
Top - 5 Accuracy	97.4

⚠️ Important Note

The team releasing X - CLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご