xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.
Developed by Salesforce AI Research, this model is based on the BLIP-3 architecture and incorporates a learnable temporal encoder module capable of processing 8-frame video inputs.
Model Features
Efficient Video Understanding
Equipped with an explicit temporal encoder, specifically designed for video content understanding.
Compact Model
An efficient compact vision-language model suitable for resource-constrained environments.
Multi-frame Processing Capability
Capable of processing 8-frame video inputs, theoretically supporting any number of frames.
Model Capabilities
Video Content Understanding
Multi-frame Video Processing
Vision-Language Tasks
Use Cases
Video Analysis
Video Question Answering
Performing video question answering tasks on the MSVD-QA dataset.
Excels in the trade-off between the number of visual tokens and accuracy.
๐ xGen-MM-Vid (BLIP-3-Video)
xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model (VLM) developed by Salesforce AI Research. It features an explicit temporal encoder, specifically designed to understand videos.
๐ Quick Start
To use our model, please refer to our inference script. This codebase is based on the xGen-MM.
โจ Features
Temporal Encoder: Incorporates a learnable temporal encoder module within the original (image-based) BLIP-3 architecture, enabling it to better understand video content.
Flexible Frame Input: In principle, it can take any number of frames, but it was trained with 8 - frame videos. Here, we share the 128 token version trained for 8 - frame video inputs. The 32 token version can be found at BLIP-3-Video 32 token model.
The inference script serves as an example of how to use our model.
๐ง Technical Details
Tokens vs. accuracy
The above figure shows the number of visual tokens vs. accuracy trade - off of various video models including xGen-MM-Vid (BLIP-3-Video) on the MSVD-QA dataset.
Examples
โ ๏ธ Important Note
The main data sources are from the internet, including webpages, video stock sites, and curated datasets released by the research community. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications.
๐ License
Our code and weights are released under the CC by - NC 4.0 license.
๐ก Usage Tip
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact peopleโs lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
@misc{blip3video-xgenmmvid,
author = {Michael S. Ryoo and Honglu Zhou and Shrikant Kendre and Can Qin and Le Xue and Manli Shu and Silvio Savarese and Ran Xu and Caiming Xiong and Juan Carlos Niebles},
title = {xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs},
year = {2024},
eprint = {2410.16267},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2410.16267},
}