🚀 🦜VideoChat-Flash-Qwen2-7B_res448⚡
VideoChat-Flash-Qwen2-7B_res448 模型构建于 UMT-L (300M) 和 Qwen2-7B 之上,每帧仅使用 16 个标记。通过利用 Yarn 将上下文窗口扩展到 128k(Qwen2 的原生上下文窗口为 32k),该模型支持输入多达约 10,000 帧的序列。
⚠️ 重要提示
由于训练语料主要为英文,模型仅具备基本的中文理解能力,为确保最佳性能,建议使用英文进行交互。
🚀 快速开始
安装依赖
首先,你需要安装 flash attention2 和其他一些模块。以下是一个简单的安装示例:
pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# 可选
pip install flash-attn --no-build-isolation
使用模型
from transformers import AutoModel, AutoTokenizer
import torch
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2-7B_res448'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor
mm_llm_compress = False
if mm_llm_compress:
model.config.mm_llm_compress = True
model.config.llm_compress_type = "uniform0_attention"
model.config.llm_compress_layer_list = [4, 18]
model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
model.config.mm_llm_compress = False
max_num_frames = 512
generation_config = dict(
do_sample=False,
temperature=0.0,
max_new_tokens=1024,
top_p=0.1,
num_beams=1
)
video_path = "your_video.mp4"
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output1)
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
print(output2)
✨ 主要特性
- 高效标记使用:每帧仅使用 16 个标记,提高处理效率。
- 长上下文支持:通过 Yarn 扩展上下文窗口到 128k,支持输入多达约 10,000 帧的序列。
📈 性能表现
📚 详细文档
模型指标
数据集评估结果
任务类型 |
数据集名称 |
准确率 |
多模态 |
MLVU |
74.7 |
多模态 |
MVBench |
74.0 |
多模态 |
Perception Test |
76.2 |
多模态 |
LongVideoBench |
64.7 |
多模态 |
VideoMME (无字幕) |
65.3 |
多模态 |
LVBench |
48.2 |
✏️ 引用
@article{li2024videochatflash,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
📄 许可证
本项目采用 Apache-2.0 许可证。