đ VideoChat-Flash-Qwen2_5-7B-1M_res224âĄ
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B is a model built on UMT-L (300M) and Qwen2.5-7B-1M. It uses only 16 tokens per frame. By leveraging Yarn to extend the context window to 1M (the native context window of Qwen2.5-7B-1M is 128k), this model supports input sequences of up to approximately 50,000 frames.
[đ° Blog] [đ GitHub] [đ Tech Report] [đ¨ī¸ Chat Demo]
â ī¸ Important Note
Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension. To ensure optimal performance, using English for interaction is recommended.
đ Quick Start
Installation
First, you need to install flash attention2 and some other modules. We provide a simple installation example below:
pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional
pip install flash-attn --no-build-isolation
Usage Examples
Basic Usage
from transformers import AutoModel, AutoTokenizer
import torch
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor
mm_llm_compress = False
if mm_llm_compress:
model.config.mm_llm_compress = True
model.config.llm_compress_type = "uniform0_attention"
model.config.llm_compress_layer_list = [4, 18]
model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
model.config.mm_llm_compress = False
max_num_frames = 512
generation_config = dict(
do_sample=False,
temperature=0.0,
max_new_tokens=1024,
top_p=0.1,
num_beams=1
)
video_path = "your_video.mp4"
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output1)
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output2)
⨠Features
Performance
đ Documentation
Model Information
Property |
Details |
Model Type |
Video-text-to-text |
Training Data |
Not specified |
Metrics |
Accuracy |
Tags |
Multimodal |
Results
- Model Name: VideoChat-Flash-Qwen2_5-7B-1M_res224
| Dataset | Accuracy |
|---------|----------|
| MLVU | 74.1 |
| MVBench | 73.4 |
| Perception Test | 75.4 |
| LongVideoBench | 66.5 |
| VideoMME (wo sub) | 63.5 |
| LVBench | 46.0 |
âī¸ Citation
@article{li2024videochatflash,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
đ License
This project is licensed under the Apache-2.0 license.