xGen-MM-Vid Open-Source Vision-Language Model - Free Deployment to Aid Efficient Understanding of Video Content

Xgen Mm Vid Phi3 Mini R V1.5 128tokens 8frames

Developed by Salesforce

xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.

Video-to-Text

Safetensors

English#Video Understanding #Compact VLM #Temporal Encoder

Downloads 398

Release Time : 12/18/2024

Model Overview

Developed by Salesforce AI Research, this model is based on the BLIP-3 architecture and incorporates a learnable temporal encoder module capable of processing 8-frame video inputs.

Model Features

Efficient Video Understanding

Equipped with an explicit temporal encoder, specifically designed for video content understanding.

Compact Model

An efficient compact vision-language model suitable for resource-constrained environments.

Multi-frame Processing Capability

Capable of processing 8-frame video inputs, theoretically supporting any number of frames.

Model Capabilities

Video Content Understanding

Multi-frame Video Processing

Vision-Language Tasks

Use Cases

Video Analysis

Video Question Answering

Performing video question answering tasks on the MSVD-QA dataset.

Excels in the trade-off between the number of visual tokens and accuracy.

🚀 xGen-MM-Vid (BLIP-3-Video)

xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model (VLM) developed by Salesforce AI Research. It features an explicit temporal encoder, specifically designed to understand videos.

🚀 Quick Start

To use our model, please refer to our inference script. This codebase is based on the xGen-MM.

✨ Features

Temporal Encoder: Incorporates a learnable temporal encoder module within the original (image-based) BLIP-3 architecture, enabling it to better understand video content.
Flexible Frame Input: In principle, it can take any number of frames, but it was trained with 8 - frame videos. Here, we share the 128 token version trained for 8 - frame video inputs. The 32 token version can be found at BLIP-3-Video 32 token model.

📚 Documentation

For more details, check out our tech report and the blog article.

📦 Installation

If you missed any packages, please consider the following installation commands:

pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts
pip install transformers==4.41.1

💻 Usage Examples

The inference script serves as an example of how to use our model.

🔧 Technical Details

Tokens vs. accuracy

The above figure shows the number of visual tokens vs. accuracy trade - off of various video models including xGen-MM-Vid (BLIP-3-Video) on the MSVD-QA dataset.

Examples

⚠️ Important Note

The main data sources are from the internet, including webpages, video stock sites, and curated datasets released by the research community. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications.

📄 License

Our code and weights are released under the CC by - NC 4.0 license.

💡 Usage Tip

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

📖 Code Acknowledgment

Our code/model is built on top of xGen-MM.

📑 Citation

@misc{blip3video-xgenmmvid,
  author          = {Michael S. Ryoo and Honglu Zhou and Shrikant Kendre and Can Qin and Le Xue and Manli Shu and Silvio Savarese and Ran Xu and Caiming Xiong and Juan Carlos Niebles},
  title           = {xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs}, 
  year            = {2024},
  eprint          = {2410.16267},
  archivePrefix   = {arXiv},
  primaryClass    = {cs.CV},
  url             = {https://arxiv.org/abs/2410.16267}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご