X

Xgen Mm Vid Phi3 Mini R V1.5 128tokens 8frames

Developed by Salesforce
xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.
Downloads 398
Release Time : 12/18/2024

Model Overview

Developed by Salesforce AI Research, this model is based on the BLIP-3 architecture and incorporates a learnable temporal encoder module capable of processing 8-frame video inputs.

Model Features

Efficient Video Understanding
Equipped with an explicit temporal encoder, specifically designed for video content understanding.
Compact Model
An efficient compact vision-language model suitable for resource-constrained environments.
Multi-frame Processing Capability
Capable of processing 8-frame video inputs, theoretically supporting any number of frames.

Model Capabilities

Video Content Understanding
Multi-frame Video Processing
Vision-Language Tasks

Use Cases

Video Analysis
Video Question Answering
Performing video question answering tasks on the MSVD-QA dataset.
Excels in the trade-off between the number of visual tokens and accuracy.
Featured Recommended AI Models
ยฉ 2025AIbase