X-CLIP Open-Source Model - A General Video-Language Understanding Tool for Video Classification and Retrieval

Xclip Base Patch16 Kinetics 600 16 Frames

Developed by microsoft

X-CLIP is an extension of CLIP for general video-language understanding, supporting zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.

Text-to-Video

Transformers

EnglishOpen Source License:MIT #Video-Text Contrastive Learning #Zero-Shot Video Classification #Multimodal Video Understanding

Downloads 393

Release Time : 9/8/2022

Model Overview

The X-CLIP model (base size, 16x16 patch resolution) is fully supervised trained on the Kinetics-600 dataset, primarily for video classification and video-text retrieval tasks.

Model Features

Video-Language Understanding

Trained contrastively on (video, text) pairs to support video-text matching tasks.

High Accuracy

Achieves 85.8% top-1 accuracy and 97.3% top-5 accuracy on the Kinetics 400 dataset.

Multi-Task Support

Can be used for zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.

Model Capabilities

Video Classification

Video-Text Retrieval

Zero-Shot Learning

Few-Shot Learning

Use Cases

Video Analysis

Video Classification

Classify video content to recognize actions or scenes in videos.

85.8% top-1 accuracy, 97.3% top-5 accuracy

Video-Text Retrieval

Retrieve relevant videos based on text descriptions or generate descriptive text from video content.

Property	Details
Model Type	X-CLIP (base-sized model)
Training Data	Kinetics-600
Top-1 Accuracy	85.8%
Top-5 Accuracy	97.3%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Xclip Base Patch16 Kinetics 600 16 Frames

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 X-CLIP (base-sized model)

🚀 Quick Start

✨ Features

📚 Documentation

Intended uses & limitations

How to use

Training data

Preprocessing

Evaluation results

📄 License