VideoChat-Flash-Qwen2_5-7B-1M_res224オープンソースモデル - 長いビデオの理解をサポートするマルチモーダルアプリケーション

ホーム

Videochat Flash Qwen2 5 7B 1M Res224

OpenGVLabによって開発

VideoChat-FlashはUMT-LとQwen2.5-7B-1Mを基に構築されたマルチモーダルモデルで、長い動画の理解をサポートし、コンテキストウィンドウを1Mまで拡張可能です。

ビデオ生成テキスト

Transformers

英語オープンソースライセンス:Apache-2.0 #超長動画理解 #低マーキングマルチモーダル #1Mコンテキストウィンドウ

ダウンロード数 64

リリース時間 : 2/19/2025

モデル概要

このモデルは動画とテキストのマルチモーダルインタラクションに特化しており、約50,000フレームまでの長い動画入力を処理でき、動画理解と分析タスクに適しています。

モデル特徴

効率的な長動画処理

Yarn技術によりコンテキストウィンドウを1Mまで拡張し、約50,000フレームまでの長い動画入力を処理可能。

低マーキング消費

フレームあたり16トークンのみ使用し、効率的な動画内容理解を実現。

マルチモーダル能力

視覚と言語の理解能力を組み合わせ、動画とテキストのインタラクションを実現。

モデル能力

動画内容理解

マルチモーダルインタラクション

長動画処理

テキスト生成

使用事例

動画分析

動画質問応答

動画内容に基づいて関連する質問に回答

MLVUデータセットで74.1%の精度を達成

動画内容理解

長い動画内容を理解し記述

LongVideoBenchで66.5%の精度を達成

マルチモーダルテスト

知覚テスト

マルチモーダル知覚能力の評価

Perception Testで75.4%の精度を達成

🚀 🦜VideoChat-Flash-Qwen2_5-7B-1M_res224⚡

VideoChat-Flash-Qwen2_5-7B_InternVideo2 - 1Bは、UMT - L (300M)とQwen2.5 - 7B - 1Mを基に構築されており、フレームあたりわずか16トークンを使用します。Yarnを利用してコンテキストウィンドウを1Mに拡張することで（Qwen2.5 - 7B - 1Mのネイティブコンテキストウィンドウは128k）、このモデルは最大で約50,000フレームの入力シーケンスをサポートします。

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

⚠️ 重要提示

主に英語のトレーニングコーパスを使用しているため、このモデルは基本的な中国語理解能力しか持っていません。最適なパフォーマンスを得るためには、英語での対話をお勧めします。

✨ 主な機能

VideoChat-Flash-Qwen2_5-7B_InternVideo2 - 1Bは、UMT - L (300M)とQwen2.5 - 7B - 1Mを基盤とし、フレームあたり16トークンのみを使用して構築されています。Yarnを用いてコンテキストウィンドウを1Mに拡張することで、最大で約50,000フレームの入力シーケンスをサポートします。

📦 インストール

まず、flash attention2と他のいくつかのモジュールをインストールする必要があります。以下に簡単なインストール例を示します。

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# オプション
pip install flash-attn --no-build-isolation

💻 使用例

基本的な使用法

from transformers import AutoModel, AutoTokenizer
import torch

# モデル設定
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # グローバル圧縮を使用するかどうか
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# 評価設定
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# シングルターン会話
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# マルチターン会話
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

📈 パフォーマンス

モデル	MVBench	LongVideoBench	VideoMME(サブなし)	最大入力フレーム数
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

✏️ 引用

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}