xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.
This model integrates a learnable temporal encoder module into the original BLIP-3 architecture, enhancing the ability to understand video content.
Model Features
Explicit temporal encoder
Equipped with an explicit temporal encoder to better understand video content.
Efficient and compact
The model is designed to be efficient and compact, suitable for processing video content.
Scalability
In principle, it can handle an arbitrary number of frames. During training, 8-frame videos are used.
Model Capabilities
Video content understanding
Multimodal processing
Time series analysis
Use Cases
Video analysis
Video question answering
Perform video question answering tasks on the MSVD-QA dataset.
It shows a good trade-off between the number of visual tokens and accuracy.
đ xGen-MM-Vid (BLIP-3-Video)
xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model (VLM) with an explicit temporal encoder, specifically designed for video understanding. Developed by Salesforce AI Research, it incorporates learnable temporal encoder modules into the original (image-based) BLIP-3 architecture.
đ Quick Start
Here, we share the 32-token version trained to take 8-frame video inputs. In principle, it can handle any number of frames, but it was trained with 8-frame videos.
For more details, check out our tech report. A more detailed explanation can also be found in the blog article.
⨠Features
An efficient compact vision-language model with an explicit temporal encoder for video understanding.
Incorporates learnable temporal encoder modules into the BLIP-3 architecture.
đ Documentation
Results
Tokens vs. accuracy
The above figure shows the trade-off between the number of visual tokens and accuracy of various video models, including xGen-MM-Vid (BLIP-3-Video), on the MSVD-QA dataset.
Examples
How to use
Please check out our inference script as an example to use our model. This codebase is based on the xGen-MM.
Bias, Risks, Limitations, and Ethical Considerations
The main data sources are from the internet, including webpages, video stock sites, and curated datasets released by the research community. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications.
Ethical Considerations
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
đ License
Our code and weights are released under the CC by-NC 4.0 license.
đĻ Installation
If you missed any packages, please consider the following:
@misc{blip3video-xgenmmvid,
author = {Michael S. Ryoo and Honglu Zhou and Shrikant Kendre and Can Qin and Le Xue and Manli Shu and Silvio Savarese and Ran Xu and Caiming Xiong and Juan Carlos Niebles},
title = {xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs},
year = {2024},
eprint = {2410.16267},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2410.16267},
}