Open-source X-CLIP Model - For General Video-Language Understanding and Enhanced Video-Text Interaction

Xclip Base Patch16 Hmdb 4 Shot

Developed by microsoft

X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning with (video, text) pairs.

Video-to-Text

Transformers

EnglishOpen Source License:MIT #Video-text contrastive learning #Few-shot video classification #Action recognition

Downloads 22

Release Time : 9/7/2022

Model Overview

This is a base-size X-CLIP model with 16-pixel patch resolution, trained in a few-shot manner (K=4) on the HMDB-51 dataset, suitable for video classification tasks.

Model Features

Few-shot learning capability

The model demonstrates good few-shot learning ability by being trained with only 4 samples on the HMDB-51 dataset.

Video-text contrastive learning

Uses contrastive learning with (video, text) pairs to enhance the model's understanding of video content.

Efficient video processing

Processes 32 frames per video at 224x224 resolution, balancing computational efficiency and model performance.

Model Capabilities

Video classification

Video-text matching

Few-shot learning

Use Cases

Video understanding

Human action recognition

Recognizing human action categories in videos

Achieves 57.3% top-1 accuracy on HMDB-51 dataset

Video retrieval

Text-based video retrieval

Retrieving relevant video clips based on text descriptions

Property	Details
Model Type	X-CLIP (base-sized, patch resolution of 16)
Training Data	HMDB - 51

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Xclip Base Patch16 Hmdb 4 Shot

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 X-CLIP (base-sized model)

🚀 Quick Start

✨ Features

📚 Documentation

Intended uses & limitations

How to use

Training data

Preprocessing

Evaluation results

📄 License