X

Xgen Mm Vid Phi3 Mini R V1.5 32tokens 8frames

Developed by Salesforce
xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.
Downloads 441
Release Time : 1/15/2025

Model Overview

This model integrates a learnable temporal encoder module into the original BLIP-3 architecture, enhancing the ability to understand video content.

Model Features

Explicit temporal encoder
Equipped with an explicit temporal encoder to better understand video content.
Efficient and compact
The model is designed to be efficient and compact, suitable for processing video content.
Scalability
In principle, it can handle an arbitrary number of frames. During training, 8-frame videos are used.

Model Capabilities

Video content understanding
Multimodal processing
Time series analysis

Use Cases

Video analysis
Video question answering
Perform video question answering tasks on the MSVD-QA dataset.
It shows a good trade-off between the number of visual tokens and accuracy.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase