VideoChat-Flash-Qwen2-7B_res448開源多模態模型 - 長幀視頻輸入處理超高效

首頁

Videochat Flash Qwen2 7B Res448

由OpenGVLab開發

VideoChat-Flash-7B是基於UMT-L (300M)和Qwen2-7B構建的多模態模型，每幀僅使用16個標記，支持輸入序列長達約10,000幀。

視頻生成文本

Transformers

英語開源協議:Apache-2.0 #超長視頻理解 #低標記多模態 #128k上下文窗口

下載量 661

發布時間 : 1/11/2025

模型概述

該模型是一個多模態視頻文本轉換模型，專注於處理視頻和文本之間的交互任務，具備高效的視頻理解和文本生成能力。

模型特點

高效視頻處理

每幀僅使用16個標記，大幅提升處理效率。

長序列支持

通過Yarn擴展上下文窗口至128k，支持輸入序列長達約10,000幀。

多模態能力

結合視頻和文本處理能力，適用於複雜的多模態任務。

模型能力

視頻理解

文本生成

多模態交互

使用案例

視頻分析

視頻問答

根據視頻內容回答相關問題。

在MLVU數據集上準確率達74.7%。

視頻摘要

生成視頻內容的文本摘要。

多模態評估

多模態基準測試

在MVBench等數據集上進行多模態性能評估。

在MVBench上準確率達74.0%。

🚀 🦜VideoChat-Flash-Qwen2-7B_res448⚡

VideoChat-Flash-Qwen2-7B_res448 模型構建於 UMT-L (300M) 和 Qwen2-7B 之上，每幀僅使用 16 個標記。通過利用 Yarn 將上下文窗口擴展到 128k（Qwen2 的原生上下文窗口為 32k），該模型支持輸入多達約 10,000 幀的序列。

⚠️ 重要提示

由於訓練語料主要為英文，模型僅具備基本的中文理解能力，為確保最佳性能，建議使用英文進行交互。

🚀 快速開始

安裝依賴

首先，你需要安裝 flash attention2 和其他一些模塊。以下是一個簡單的安裝示例：

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# 可選 
pip install flash-attn --no-build-isolation

使用模型

from transformers import AutoModel, AutoTokenizer
import torch

# 模型設置
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2-7B_res448'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # 是否使用全局壓縮
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# 評估設置
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# 單輪對話
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# 多輪對話
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✨ 主要特性

高效標記使用：每幀僅使用 16 個標記，提高處理效率。
長上下文支持：通過 Yarn 擴展上下文窗口到 128k，支持輸入多達約 10,000 幀的序列。

📈 性能表現

模型	MVBench	LongVideoBench	VideoMME(無字幕)	最大輸入幀數
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

📚 詳細文檔

模型指標

屬性	詳情
模型類型	多模態
評估指標	準確率

數據集評估結果

任務類型	數據集名稱	準確率
多模態	MLVU	74.7
多模態	MVBench	74.0
多模態	Perception Test	76.2
多模態	LongVideoBench	64.7
多模態	VideoMME (無字幕)	65.3
多模態	LVBench	48.2

✏️ 引用

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}