slowfast-video-mllm-qwen2開源視頻多模態模型 - 平衡時空支持64幀視頻理解

首頁

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

由shi-labs開發

採用慢-快架構的視頻多模態大語言模型，平衡時間分辨率和空間細節，支持64幀視頻理解

視頻生成文本

Transformers

#視頻理解 #多模態LLM #時空雙token

下載量 184

發布時間 : 3/19/2025

模型概述

該模型創新性地採用慢-快雙token策略處理視頻輸入，結合Qwen2-7B語言模型和ConvNeXt-576視覺編碼器，在有限計算預算下實現高效的視頻理解

模型特點

慢-快雙token策略

通過快token快速瀏覽視頻內容，慢token精細提取視覺細節，實現高效視頻理解

高幀率處理

支持64幀視頻輸入，時間分辨率顯著優於傳統方法

線性複雜度交叉注意力

特製混合解碼層實現文本對原始視頻特徵的線性複雜度交叉注意力

模型能力

視頻內容理解

視頻內容描述生成

多模態推理

長視頻處理

使用案例

視頻內容分析

視頻內容描述

對輸入視頻生成詳細的內容描述

在視頻理解基準測試中優於純自注意力基線

智能監控

監控視頻分析

分析監控視頻中的關鍵事件

🚀 視頻多模態大語言模型的快慢架構 (Qwen2-7B, 64幀)

本倉庫包含了快慢視頻多模態大語言模型（Qwen2-7B、ConvNeXt-576、64幀、步長1/4） 模型，該模型在論文視頻多模態大語言模型的快慢架構中被提出。

代碼倉庫 | HuggingFace 集合

✨ 主要特性

本模型引入了一種新穎的快慢架構，旨在解決在有限計算資源預算下，基於視頻的多模態大語言模型（MLLMs）在平衡時間分辨率和空間細節方面的挑戰。現有的方法通常會不可逆地壓縮視頻表示，從而丟失細節。

受人類先瀏覽視頻再關注相關部分的方式啟發，快慢設計採用了雙令牌策略：

“快”視覺令牌：一組緊湊的壓縮視頻特徵，與文本嵌入一起輸入到大語言模型（Qwen2-7B-Instruct）中，以快速概覽視頻內容。
“慢”視覺令牌：未壓縮的視頻特徵通過專門設計的混合解碼器層與文本嵌入進行交叉注意力計算，從而能夠以線性複雜度進行與指令相關的視覺細節提取。

這種方法允許處理更多的輸入幀（例如，此檢查點可處理64幀），同時保留空間細節，與僅使用自注意力的基線模型相比，在視頻理解基準測試中取得了顯著的性能提升。此檢查點使用Qwen2-7B-Instruct作為基礎大語言模型，並使用ConvNeXt-576作為視覺塔。

📦 安裝指南

注意：此模型依賴於集成在 transformers 庫中的自定義代碼（LlavaQwenSlowFastForCausalLM）。請確保你已從官方倉庫安裝了必要的軟件包，或者在加載模型時使用 trust_remote_code=True。

如果你在本地運行，請先克隆倉庫並安裝依賴項：

git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
cd Slow-Fast-Video-Multimodal-LLM
pip install --upgrade pip
pip install -r requirements.txt
# 將克隆的倉庫路徑添加到你的PYTHONPATH或進行安裝

💻 使用示例

基礎用法

import torch
import os
import numpy as np
from decord import VideoReader, cpu
import requests # Required to download video

# Make sure the necessary llava modules are importable
# If not installed from the repo, trust_remote_code=True handles this
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init


def load_video(video_path, max_frames_num):
        """Helper function to load video frames."""
        vr = VideoReader(video_path, num_threads=4)
        total_frames = len(vr)

        # Ensure sparse sampling doesn't lead to fewer frames than requested
        if total_frames >= max_frames_num:
            # Uniformly sample frames across the video
            uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
            frame_idx = uniform_sampled_frames.tolist()
        else:
            # If video is shorter than max_frames_num, sample all frames and repeat the last
            frame_idx = list(range(total_frames))
            frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))

        try:
            spare_frames = vr.get_batch(frame_idx).asnumpy()
        except Exception as e:
            print(f"Error loading video frames: {e}")
            # Fallback or error handling: return None or raise exception
            # Example: return a black frame tensor of the expected shape
            # This part depends on how image_processor handles None or errors
            # For now, re-raising the exception might be best
            raise e

        return spare_frames

# Model configuration
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
video_local_path = "catinterrupt.mp4"
question = "Please describe this video in detail."
max_frames = 64 # This checkpoint was trained with 64 frames

# Download the video if it doesn't exist
if not os.path.exists(video_local_path):
    print(f"Downloading video from {video_url}...")
    response = requests.get(video_url, stream=True)
    response.raise_for_status() # Raise an exception for bad status codes
    with open(video_local_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print("Download complete.")


# Load the model and processor
disable_torch_init()
model_name = get_model_name_from_path(model_path)

# Use trust_remote_code=True to load the custom architecture
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path,
    None,
    model_name,
    use_flash_attn=True,      # Use Flash Attention if available
    device_map="auto",        # Automatically distribute model across GPUs/CPU
    torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
    trust_remote_code=True
)

# Prepare the prompt
if model.config.mm_use_im_start_end:
    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + " " + question
else:
    prompt = DEFAULT_IMAGE_TOKEN + " " + question

conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt_final = conv.get_prompt()

# Load and process video frames
print("Loading video...")
video_frames = load_video(video_local_path, max_frames_num=max_frames)
print(f"Video loaded, shape: {video_frames.shape}")

# Preprocess video frames
print("Preprocessing video...")
# Ensure video has shape (T, H, W, C) before preprocessing
video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
videos = [video_tensor] # The model expects a list of video tensors
print(f"Video tensor processed, shape: {videos[0].shape}")


# Tokenize the prompt
input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device=model.device, non_blocking=True)
# Add batch dimension if necessary (tokenizer_image_token might already return batched)
if input_ids.ndim == 1:
    input_ids = input_ids.unsqueeze(0)
print(f"Input IDs processed, shape: {input_ids.shape}")


# Generate response
print("Generating response...")
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=videos, # Pass the processed video tensor list
        do_sample=True,
        temperature=0.2,
        top_p=1.0,
        num_beams=1,
        max_new_tokens=1024,
        use_cache=True
    )

# Decode and print the output
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"User input: {question}")
print(f"Model output: {outputs}")

📄 許可證

模型權重遵循 CC-BY-NC-4.0 許可證發佈。代碼遵循 Apache 2.0 許可證發佈。用戶必須遵守原始許可證的所有條款和條件，包括基礎語言模型的特定許可證（Qwen2 許可證）。

📚 詳細文檔

引用

如果你覺得這項工作有用，請考慮引用該論文：

@misc{zhou2025slowfast,
      title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
      author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
      year={2025},
      eprint={2504.01328},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

(注意：作者列表可能會根據 arXiv 論文的更新而有所變化；如果有最終發佈版本，請以其為準。)