ShareCaptioner-Video Open-Source Video Description Generator - Generate Descriptions for Videos of Different Formats for Free

Sharecaptioner Video

Developed by Lin-Chen

An open-source video caption generator fine-tuned on GPT4V-annotated data, supporting videos of various durations, aspect ratios, and resolutions

Video-to-Text

Transformers

#Video Dense Captioning #Sliding Window Difference #GPT4V-Assisted Annotation

Downloads 264

Release Time : 6/6/2024

Model Overview

ShareCaptioner-Video is an open-source video caption generator fine-tuned on the ShareGPT4Video detailed description dataset annotated with GPT4V assistance. It supports four main functions: rapid caption generation, sliding window captioning, segment summarization, and prompt rewriting.

Model Features

Rapid Caption Generation

Generates video captions directly in image grid format, providing ultra-fast generation for short videos

Sliding Window Captioning

Supports streaming caption generation in differential sliding window format, delivering high-quality captions for long videos

Segment Summarization

Quickly summarizes video segments or previously processed sliding window captions without re-processing frame data

Prompt Rewriting

Rewrites input prompts according to user preferences in video generation domains, ensuring format consistency for text-to-video models during inference

Model Capabilities

Video Caption Generation

Streaming Caption for Long Videos

Video Segment Summarization

Prompt Optimization

Use Cases

Video Content Understanding

Short Video Caption Generation

Quickly generates detailed captions for short videos

Improves efficiency in short video content understanding

Long Video Content Analysis

Analyzes long video content through sliding window technology

Achieves refined understanding of long videos

Video Generation Assistance

Prompt Optimization

Optimizes input prompts for text-to-video models

Enhances consistency between generated videos and text descriptions

🚀 ShareCaptioner-Video Model Card

ShareCaptioner-Video is an open - source captioner that can generate high - quality video captions, based on the InternLM - Xcomposer2 - 4KHD model.

🚀 Quick Start

This section provides an overview of the ShareCaptioner-Video model, including its details, intended use, finetuning dataset, and related paper.

✨ Features

ShareCaptioner-Video is an open-source captioner fine-tuned on GPT4V-assisted ShareGPT4Video detailed caption data, supporting various durations, aspect ratios, and resolutions of videos. It is based on the InternLM-Xcomposer2-4KHD model and features 4 roles:

Fast Captioning: The model uses an image-grid format for direct video captioning, offering rapid generation speeds suitable for short videos. In practice, all the keyframes of a video are concatenated into a vertically elongated image, and the model is trained on a caption task.
Sliding Captioning: The model supports streaming captioning in a differential sliding-window format, producing high-quality captions suitable for long videos. Two adjacent keyframes and the previous differential caption are taken as input, and the model is trained to describe the events between them.
Clip Summarizing: The model can quickly summarize any clip from ShareGPT4Video or videos that have undergone the differential sliding-window captioning process, without re - processing frames. All the differential descriptions are used as input, and the output is the video caption.
Prompt Re - Captioning: The model can rephrase prompts input by users who prefer specific video generation areas, ensuring that T2VMs trained on high-quality video-caption data maintain format alignment during inference with their training. In practice, GPT - 4 is used to generate Sora - style prompts for dense captions, and the re - captioning task is trained in reverse, i.e., using the generated prompt as input and the dense caption as the training target.

📚 Documentation

Model Details

Model type: ShareCaptioner-Video is an open-source captioner fine-tuned on specific data and based on the InternLM-Xcomposer2-4KHD model.
Model date: ShareCaptioner was trained in May 2024.
Paper or resources for more information: [Project] [Paper] [Code]

Intended Use

Primary intended uses: The primary use of ShareCaptioner-Video is to produce high-quality video captions.
Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Finetuning Dataset

40K GPT4V-generated video-caption pairs
40K differential sliding-window captioning conversations
40K prompt-to-caption textual data

Paper

arxiv.org/abs/2406.04325

Property	Details
Model Type	ShareCaptioner-Video is an open-source captioner fine-tuned on GPT4V-assisted ShareGPT4Video detailed caption data with supporting various durations, aspect ratios, and resolutions of videos. It is based on the InternLM-Xcomposer2-4KHD model.
Training Data	40K GPT4V-generated video-caption pairs, 40K differential sliding-window captioning conversations, 40K prompt-to-caption textual data
Model Date	May 2024
Paper or Resources	[Project] [Paper] [Code]
Intended Use	Producing high-quality video captions
Intended Users	Researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence
Paper	arxiv.org/abs/2406.04325

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご