SpaceTimeGPT Open-Source Video Description Generation Model - Freely Achieve Spatio-Temporal Reasoning and Video Event Description

Spacetimegpt

Developed by Neleac

SpaceTime GPT is a video description generation model capable of spatial and temporal reasoning, analyzing video frames and generating sentences describing video events.

Video-to-Text

Transformers

English#Video Autoregressive Description #Spatiotemporal Joint Modeling #Multi-frame Visual Encoding

Downloads 2,877

Release Time : 4/21/2023

Model Overview

This model combines a visual encoder and a text decoder to extract key frames from videos and generate corresponding textual descriptions, suitable for video captioning tasks.

Model Features

Spatiotemporal Reasoning

Capable of analyzing both spatial and temporal information in videos to generate accurate descriptions.

Pretrained Model Integration

Combines the strengths of the Timesformer video classification model and the GPT-2 text generation model.

Multi-frame Analysis

Samples and analyzes eight frames from videos for comprehensive understanding of video content.

Model Capabilities

Video Caption Generation

Video Content Understanding

Spatiotemporal Information Processing

Use Cases

Video Content Analysis

Automatic Video Captioning

Automatically generates descriptive captions for videos to improve accessibility.

Generated descriptions accurately reflect video content

Video Content Understanding

Analyzes video content to extract key events and actions.

Capable of identifying main activities and scenes in videos

🚀 SpaceTimeGPT - Video Captioning Model

SpaceTimeGPT is a video description generation model that can perform spatial and temporal reasoning. It samples and analyzes eight frames from a given video, then uses autoregression to generate a sentence describing the events in the video.

📚 Documentation

Dataset

HuggingFaceM4/vatex

Language

English

Metrics

BLEU
METEOR
ROUGE

Pipeline Tag

Video-text-to-text

Inference

Enabled

Model Index

Property	Details
Model Name	Caelen
Task Type	Video-captioning
Dataset Type	Video-captioning
Dataset Name	VATEX
Metric Name	CIDEr
Metric Type	Image-captioning
Metric Value	67.3
Verified	False

✨ Features

Architecture and Training

Vision Encoder: timesformer-base-finetuned-k600
Text Decoder: gpt2

The encoder and decoder are initialized with pretrained weights for video classification and sentence completion respectively. Encoder-decoder cross attention is used to integrate the visual and linguistic domains. The model is fine-tuned end-to-end for the video captioning task. For more details, please refer to the GitHub repository.

💻 Usage Examples

Basic Usage

import av
import numpy as np
import torch
from transformers import AutoImageProcessor, AutoTokenizer, VisionEncoderDecoderModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# load pretrained processor, tokenizer, and model
image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = VisionEncoderDecoderModel.from_pretrained("Neleac/timesformer-gpt2-video-captioning").to(device)

# load video
video_path = "never_gonna_give_you_up.mp4"
container = av.open(video_path)

# extract evenly spaced frames from video
seg_len = container.streams.video[0].frames
clip_len = model.config.encoder.num_frames
indices = set(np.linspace(0, seg_len, num=clip_len, endpoint=False).astype(np.int64))
frames = []
container.seek(0)
for i, frame in enumerate(container.decode(video=0)):
    if i in indices:
        frames.append(frame.to_ndarray(format="rgb24"))

# generate caption
gen_kwargs = {
    "min_length": 10, 
    "max_length": 20, 
    "num_beams": 8,
}
pixel_values = image_processor(frames, return_tensors="pt").pixel_values.to(device)
tokens = model.generate(pixel_values, **gen_kwargs)
caption = tokenizer.batch_decode(tokens, skip_special_tokens=True)[0]
print(caption) # A man and a woman are dancing on a stage in front of a mirror.

👨‍💻 Author Information

👾 Discord
🐙 GitHub
🤝 LinkedIn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご