Open-source Model VideoChat-Flash-Qwen2_5-7B-1M_res224 - Supports Multimodal Applications for Long Video Understanding

Videochat Flash Qwen2 5 7B 1M Res224

Developed by OpenGVLab

VideoChat-Flash is a multimodal model built upon UMT-L and Qwen2.5-7B-1M, supporting long video understanding with an extended context window of 1M.

Video-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Long Video Understanding #Low-Token Multimodal #1M Context Window

Downloads 64

Release Time : 2/19/2025

Model Overview

This model focuses on multimodal interaction between video and text, capable of processing video inputs of up to approximately 50,000 frames, suitable for video understanding and analysis tasks.

Model Features

Efficient Long Video Processing

Extends the context window to 1M via Yarn technology, supporting video inputs of up to approximately 50,000 frames.

Low-Token Consumption

Uses only 16 tokens per frame for efficient video content understanding.

Multimodal Capability

Combines visual and language comprehension for video-text interaction.

Model Capabilities

Video Content Understanding

Multimodal Interaction

Long Video Processing

Text Generation

Use Cases

Video Analysis

Video Question Answering

Answer questions based on video content

Achieves 74.1% accuracy on the MLVU dataset

Video Content Understanding

Understand and describe long video content

Achieves 66.5% accuracy on LongVideoBench

Multimodal Testing

Perception Testing

Evaluation of multimodal perception capabilities

Achieves 75.4% accuracy on Perception Test

🚀 VideoChat-Flash-Qwen2_5-7B-1M_res224⚡

VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B is a model built on UMT-L (300M) and Qwen2.5-7B-1M. It uses only 16 tokens per frame. By leveraging Yarn to extend the context window to 1M (the native context window of Qwen2.5-7B-1M is 128k), this model supports input sequences of up to approximately 50,000 frames.

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

⚠️ Important Note

Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension. To ensure optimal performance, using English for interaction is recommended.

🚀 Quick Start

Installation

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional
pip install flash-attn --no-build-isolation

Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✨ Features

Performance

Model	MVBench	LongVideoBench	VideoMME(w/o sub)	Max input frames
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

📚 Documentation

Model Information

Property	Details
Model Type	Video-text-to-text
Training Data	Not specified
Metrics	Accuracy
Tags	Multimodal

Results

Model Name: VideoChat-Flash-Qwen2_5-7B-1M_res224 | Dataset | Accuracy | |---------|----------| | MLVU | 74.1 | | MVBench | 73.4 | | Perception Test | 75.4 | | LongVideoBench | 66.5 | | VideoMME (wo sub) | 63.5 | | LVBench | 46.0 |

✏️ Citation

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご