slowfast-video-mllm-qwen2オープンソース動画多モーダルモデル - 時空をバランスさせ、64フレームの動画理解をサポート

ホーム

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

shi-labsによって開発

スローファストアーキテクチャを採用した動画マルチモーダル大規模言語モデルで、時間解像度と空間詳細のバランスを取り、64フレームの動画理解をサポート

ビデオ生成テキスト

Transformers

#動画理解 #マルチモーダルLLM #時空間デュアルトークン

ダウンロード数 184

リリース時間 : 3/19/2025

モデル概要

このモデルは革新的なスローファストデュアルトークン戦略で動画入力を処理し、Qwen2-7B言語モデルとConvNeXt-576視覚エンコーダを組み合わせ、限られた計算予算で効率的な動画理解を実現

モデル特徴

スローファストデュアルトークン戦略

ファストトークンで動画内容を素早く把握、スロートークンで視覚的詳細を精密に抽出し、効率的な動画理解を実現

高フレームレート処理

64フレームの動画入力をサポートし、時間解像度が従来手法を大幅に上回る

線形複雑度クロスアテンション

特別設計の混合デコード層により、テキストと元の動画特徴量の線形複雑度クロスアテンションを実現

モデル能力

動画内容理解

動画内容記述生成

マルチモーダル推論

長尺動画処理

使用事例

動画内容分析

動画内容記述

入力動画に対して詳細な内容記述を生成

動画理解ベンチマークで純粋なセルフアテンションベースラインを上回る性能

インテリジェント監視

監視カメラ映像分析

監視映像中の重要なイベントを分析

🚀 ビデオ多モーダル大規模言語モデルのSlow - Fastアーキテクチャ (Qwen2 - 7B, 64フレーム)

このリポジトリには、論文 Slow - Fast Architecture for Video Multi - Modal Large Language Models で紹介された Slow - Fast Video MLLM (Qwen2 - 7B, ConvNeXt - 576, 64フレーム, ストライド1/4) モデルが含まれています。

[コードリポジトリ](https://github.com/SHI - Labs/Slow - Fast - Video - Multimodal - LLM) | [HuggingFaceコレクション](https://huggingface.co/collections/shi - labs/slow - fast - video - mllm - 67ef347a28772734c15a78b5)

🚀 クイックスタート

このモデルは、ビデオベースの多モーダル大規模言語モデル（MLLM）において、限られた計算リソースの下で時間分解能と空間的な詳細をバランスさせる課題に対処するための新しいSlow - Fastアーキテクチャを導入しています。

✨ 主な機能

人間が最初にビデオをスキャンしてから関連する部分に焦点を当てる方法にインスパイアされたSlow - Fast設計は、二重トークン戦略を採用しています。

"Fast" ビジュアルトークン：圧縮されたビデオ特徴量のコンパクトなセットで、テキスト埋め込みとともにLLM（Qwen2 - 7B - Instruct）に入力され、迅速な概要を提供します。
"Slow" ビジュアルトークン：圧縮されていないビデオ特徴量で、特別に設計されたハイブリッドデコーダ層を介してテキスト埋め込みによってクロスアテンションされ、線形計算量で関連するビジュアル詳細を命令に応じて抽出できます。

このアプローチにより、空間的な詳細を保持しながらより多くの入力フレーム（例えば、このチェックポイントでは64フレーム）を処理でき、自己アテンションのみのベースラインと比較してビデオ理解ベンチマークで大幅な性能向上が得られます。このチェックポイントは、Qwen2 - 7B - InstructベースのLLMとConvNeXt - 576ビジョンタワーを使用しています。

💻 使用例

基本的な使用法

import torch
import os
import numpy as np
from decord import VideoReader, cpu
import requests # Required to download video

# Make sure the necessary llava modules are importable
# If not installed from the repo, trust_remote_code=True handles this
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init


def load_video(video_path, max_frames_num):
        """Helper function to load video frames."""
        vr = VideoReader(video_path, num_threads=4)
        total_frames = len(vr)

        # Ensure sparse sampling doesn't lead to fewer frames than requested
        if total_frames >= max_frames_num:
            # Uniformly sample frames across the video
            uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
            frame_idx = uniform_sampled_frames.tolist()
        else:
            # If video is shorter than max_frames_num, sample all frames and repeat the last
            frame_idx = list(range(total_frames))
            frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))

        try:
            spare_frames = vr.get_batch(frame_idx).asnumpy()
        except Exception as e:
            print(f"Error loading video frames: {e}")
            # Fallback or error handling: return None or raise exception
            # Example: return a black frame tensor of the expected shape
            # This part depends on how image_processor handles None or errors
            # For now, re-raising the exception might be best
            raise e

        return spare_frames

# Model configuration
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
video_local_path = "catinterrupt.mp4"
question = "Please describe this video in detail."
max_frames = 64 # This checkpoint was trained with 64 frames

# Download the video if it doesn't exist
if not os.path.exists(video_local_path):
    print(f"Downloading video from {video_url}...")
    response = requests.get(video_url, stream=True)
    response.raise_for_status() # Raise an exception for bad status codes
    with open(video_local_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print("Download complete.")


# Load the model and processor
disable_torch_init()
model_name = get_model_name_from_path(model_path)

# Use trust_remote_code=True to load the custom architecture
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path,
    None,
    model_name,
    use_flash_attn=True,      # Use Flash Attention if available
    device_map="auto",        # Automatically distribute model across GPUs/CPU
    torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
    trust_remote_code=True
)

# Prepare the prompt
if model.config.mm_use_im_start_end:
    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + question
else:
    prompt = DEFAULT_IMAGE_TOKEN + "\n" + question

conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt_final = conv.get_prompt()

# Load and process video frames
print("Loading video...")
video_frames = load_video(video_local_path, max_frames_num=max_frames)
print(f"Video loaded, shape: {video_frames.shape}")

# Preprocess video frames
print("Preprocessing video...")
# Ensure video has shape (T, H, W, C) before preprocessing
video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
videos = [video_tensor] # The model expects a list of video tensors
print(f"Video tensor processed, shape: {videos[0].shape}")


# Tokenize the prompt
input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device=model.device, non_blocking=True)
# Add batch dimension if necessary (tokenizer_image_token might already return batched)
if input_ids.ndim == 1:
    input_ids = input_ids.unsqueeze(0)
print(f"Input IDs processed, shape: {input_ids.shape}")


# Generate response
print("Generating response...")
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=videos, # Pass the processed video tensor list
        do_sample=True,
        temperature=0.2,
        top_p=1.0,
        num_beams=1,
        max_new_tokens=1024,
        use_cache=True
    )

# Decode and print the output
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"\nUser input: {question}\n")
print(f"Model output:\n{outputs}")

高度な使用法

# 高度な使用法の説明は、基本的な使用法のコードを基に、特定のパラメータを変更することで、異なるビデオや質問に対応できます。例えば、max_framesやquestionを変更することで、異なるフレーム数や質問に対する応答を得ることができます。

📄 ライセンス

モデルの重みは CC - BY - NC - 4.0ライセンスの下で公開されています。コードはApache 2.0ライセンスの下で公開されています。ユーザーは、元のライセンスのすべての条件に従う必要があります。これには、ベース言語モデルの特定のライセンス（[Qwen2ライセンス](https://huggingface.co/Qwen/Qwen2 - 7B - Instruct/blob/main/LICENSE)）も含まれます。

引用

この研究が役立った場合は、以下の論文を引用してください。

@misc{zhou2025slowfast,
      title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
      author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
      year={2025},
      eprint={2504.01328},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

(注: 著者リストはarXiv論文の更新に基づく可能性があります。可能であれば、最終的な公開版で確認してください。)