xGen-MM-Vid Open-Source Vision-Language Model - Efficiently understand video content, free to deploy and extremely practical!

Xgen Mm Vid Phi3 Mini R V1.5 32tokens 8frames

Developed by Salesforce

xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.

Video-to-Text

Safetensors

English#Video understanding #Temporal encoder #Efficient and compact

Downloads 441

Release Time : 1/15/2025

Model Overview

This model integrates a learnable temporal encoder module into the original BLIP-3 architecture, enhancing the ability to understand video content.

Model Features

Explicit temporal encoder

Equipped with an explicit temporal encoder to better understand video content.

Efficient and compact

The model is designed to be efficient and compact, suitable for processing video content.

Scalability

In principle, it can handle an arbitrary number of frames. During training, 8-frame videos are used.

Model Capabilities

Video content understanding

Multimodal processing

Time series analysis

Use Cases

Video analysis

Video question answering

Perform video question answering tasks on the MSVD-QA dataset.

It shows a good trade-off between the number of visual tokens and accuracy.

🚀 xGen-MM-Vid (BLIP-3-Video)

xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model (VLM) with an explicit temporal encoder, specifically designed for video understanding. Developed by Salesforce AI Research, it incorporates learnable temporal encoder modules into the original (image-based) BLIP-3 architecture.

🚀 Quick Start

Here, we share the 32-token version trained to take 8-frame video inputs. In principle, it can handle any number of frames, but it was trained with 8-frame videos.

The 128-token version of the same model can be found at: BLIP-3-Video 128 token model.

For more details, check out our tech report. A more detailed explanation can also be found in the blog article.

✨ Features

An efficient compact vision-language model with an explicit temporal encoder for video understanding.
Incorporates learnable temporal encoder modules into the BLIP-3 architecture.

📚 Documentation

Results

Tokens vs. accuracy

The above figure shows the trade-off between the number of visual tokens and accuracy of various video models, including xGen-MM-Vid (BLIP-3-Video), on the MSVD-QA dataset.

Examples

How to use

Please check out our inference script as an example to use our model. This codebase is based on the xGen-MM.

Bias, Risks, Limitations, and Ethical Considerations

The main data sources are from the internet, including webpages, video stock sites, and curated datasets released by the research community. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications.

Ethical Considerations

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people's lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

📄 License

Our code and weights are released under the CC by-NC 4.0 license.

📦 Installation

If you missed any packages, please consider the following:

pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts
pip install transformers==4.41.1

📖 Code acknowledgment

Our code/model is built on top of xGen-MM.

📚 Citation

@misc{blip3video-xgenmmvid,
  author          = {Michael S. Ryoo and Honglu Zhou and Shrikant Kendre and Can Qin and Le Xue and Manli Shu and Silvio Savarese and Ran Xu and Caiming Xiong and Juan Carlos Niebles},
  title           = {xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs}, 
  year            = {2024},
  eprint          = {2410.16267},
  archivePrefix   = {arXiv},
  primaryClass    = {cs.CV},
  url             = {https://arxiv.org/abs/2410.16267}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご