A

Auroracap 7B VID Xtuner

Developed by wchai
AuroraCap is a multimodal large language model for image and video captioning, focusing on efficient and detailed video caption generation.
Downloads 31
Release Time : 9/24/2024

Model Overview

AuroraCap is a multimodal large language model based on Vicuna-7B, specifically designed for generating detailed video captions. It supports various video captioning and video question answering tasks and has demonstrated excellent performance across multiple benchmarks.

Model Features

Efficient video caption generation
AuroraCap achieves efficient training and inference through token merging technology, maintaining high performance while accelerating processing speed.
Multi-task support
Supports various tasks such as video detailed captioning, video captioning, and video question answering, adapting to different application scenarios.
Multi-format weight support
Provides weights in official LLaVA format and Xtuner format, facilitating continued training and rapid deployment.

Model Capabilities

Video detailed caption generation
Video caption generation
Video question answering
Multimodal processing

Use Cases

Video content analysis
Video caption generation
Generates detailed captions for videos, enhancing the accessibility and understanding of video content.
Achieved a VDC score of 38.21 on the VDC benchmark.
Video question answering
Answers complex questions about video content, applicable in fields such as education and entertainment.
Achieved an accuracy of 61.8 on the ActivityNet dataset.
Multimodal applications
Image and video captioning
Generates detailed descriptive captions for images and videos, suitable for content management and retrieval.
Achieved a CIDEr score of 33.1 on the MSR-VTT dataset.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase