đ VideoChat-Flash-Qwen2_5-2B_res448âĄ
VideoChat-Flash-2B is a model built on UMT-L (300M) and Qwen2.5-1.5B, using only 16 tokens per frame. By extending the context window to 128k with Yarn (Qwen2's native context window is 32k), it can support input sequences of up to about 10,000 frames. This model is suitable for multimodal tasks, especially video - text - to - text processing.
â ī¸ Important Note
Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension. To ensure optimal performance, using English for interaction is recommended.
đ Quick Start
Installation
First, you need to install flash attention2 and some other modules. We provide a simple installation example below:
pip install transformers==4.40.1
pip install timm
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional
pip install flash-attn --no-build-isolation
Usage Examples
Basic Usage
from transformers import AutoModel, AutoTokenizer
import torch
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor
mm_llm_compress = False
if mm_llm_compress:
model.config.mm_llm_compress = True
model.config.llm_compress_type = "uniform0_attention"
model.config.llm_compress_layer_list = [4, 18]
model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
model.config.mm_llm_compress = False
max_num_frames = 512
generation_config = dict(
do_sample=False,
temperature=0.0,
max_new_tokens=1024,
top_p=0.1,
num_beams=1
)
video_path = "your_video.mp4"
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output1)
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output2)
⨠Features
- Efficient Token Utilization: Uses only 16 tokens per frame, optimizing resource usage.
- Extended Context Window: Supports input sequences of up to about 10,000 frames by extending the context window to 128k.
đ Performance
đ Documentation
đ License
This project is licensed under the Apache-2.0 license.
âī¸ Citation
@article{li2024videochatflash,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
đ Model Details
Property |
Details |
Model Type |
VideoChat-Flash-Qwen2_5-2B_res448 |
Training Data |
Not provided |
Metrics |
Accuracy on MVBench, LongVideoBench, VideoMME (w/o sub), etc. |