TEMPURA-Qwen2.5-VL-3B-s2 Open-source Vision Language Model - Capable of Inferring Causal Relationships and Generating Video Timestamp Descriptions

TEMPURA Qwen2.5 VL 3B S2

Developed by andaba

TEMPURA is a vision-language model capable of reasoning causal event relationships and generating fine-grained timestamp descriptions for unedited videos.

Video-to-Text

Transformers

#Video Causal Reasoning #Fine-grained Temporal Segmentation #Event Temporal Understanding

Downloads 102

Release Time : 5/3/2025

Model Overview

By integrating causal reasoning with fine-grained temporal segmentation, this model enhances the understanding of video temporal sequences, making it suitable for temporal localization and highlight detection in videos.

Model Features

Causal Event Relationship Reasoning

Capable of understanding causal relationships between events in videos, enhancing temporal comprehension.

Fine-grained Temporal Segmentation

Generates fine-grained timestamp descriptions for videos, enabling precise temporal localization.

Multi-task Processing

Simultaneously handles masked event prediction and video event segmentation tasks.

Model Capabilities

Video Temporal Localization

Video Highlight Detection

Video Event Causal Reasoning

Video Temporal Understanding

Generating Timestamp Descriptions

Use Cases

Video Analysis

Video Summarization

Automatically generates summaries of key events in videos.

Event Extraction

Extracts important events and their temporal information from videos.

Intelligent Q&A

Video Question Answering System

Answers questions about video temporal sequences and event relationships.

🚀 TEMPURA Vision-Language Model

TEMPURA is a vision-language model designed to reason about causal event relationships and generate fine - grained, timestamped descriptions of untrimmed videos.

🚀 Quick Start

Inference: Check the inference example.
Training: Check the model training script.

✨ Features

TEMPURA enhances video temporal understanding by integrating causal reasoning with fine - grained temporal segmentation.

📚 Documentation

Model Details

Model Description

TEMPURA enhances video temporal understanding by integrating causal reasoning with fine - grained temporal segmentation. More details can be found on the project page.

Developed by: Jen - Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi - Hao Peng, Hou - I Liu, Hsiang - Wei Huang, Kuang - Ming Chen, Cheng - Yen Yang, Wenhao Chai, Yi - Ling Chen, Vibhav Vineet, Qin Cai, Jenq - Neng Hwang
Model type: Video - Language Model
Language(s) (NLP): English
License: cc - by - 4.0
Finetuned from model: Qwen/Qwen2.5 - VL - 3B - Instruct

Model Sources

Repository: https://github.com/andy-cheng/TEMPURA
Paper: TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
Project Page: https://andy-cheng.github.io/TEMPURA/

Uses

Direct Use

The model can be used directly for temporal grounding and highlight detection in videos.

Downstream Use

The model can be fine - tuned for various applications requiring temporal video understanding, such as video summarization, event extraction, and question answering.

Out - of - Scope Use

The model may not perform well on videos with significantly different visual styles or languages compared to the training data.

Bias, Risks, and Limitations

The model's performance is influenced by biases present in the VER dataset. Further analysis is needed to fully characterize these biases.

Recommendations

Users should be aware of potential biases in the model's outputs.

Training Details

Training Data

The model was trained on the VER dataset (https://huggingface.co/datasets/andaba/TEMPURA-VER).

Training Procedure

The training procedure involves masked event prediction and video event segmentation with temporal dense captioning. See the training scripts in the repository for details.

Training Hyperparameters

Training regime: [More Information Needed]

Evaluation

The evaluation section currently lacks information about testing data, factors, metrics, and results.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications

This section currently lacks information about model architecture, objective, compute infrastructure (hardware and software).

Citation

BibTeX:

@article{tempura,
       title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, 
       author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
              and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
              and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
       journal={arXiv preprint arXiv:2505.01583},
       year={2025}
}

APA: Cheng, J. - H., Wang, V., Wang, H., Zhou, H., Peng, Y. - H., Liu, H. - I., Huang, H. - W., Chen, K. - M., Yang, C. - Y., Chai, W., Chen, Y. - L., Vineet, V., Cai, Q., & Hwang, J. - N. (2025). TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action. arXiv preprint arXiv:2505.01583.

Model Card Contact

Jen - Hao Cheng, andyhci@uw.edu

Information Table

Property	Details
Model Type	Video - Language Model
Training Data	The model was trained on the VER dataset (https://huggingface.co/datasets/andaba/TEMPURA-VER)
License	cc - by - 4.0
Finetuned from model	Qwen/Qwen2.5 - VL - 3B - Instruct

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご