SlowFast-Video-MLLM-Qwen2 Open-Source Video Multimodal Model - Balancing Space and Time for 64-Frame Video Understanding

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

Developed by shi-labs

A video multimodal large language model using a slow-fast architecture, balancing temporal resolution and spatial details, supporting 64-frame video understanding

Video-to-Text

Transformers

#Video Understanding #Multimodal LLM #Spatial-Temporal Dual Token

Downloads 184

Release Time : 3/19/2025

Model Overview

This model innovatively adopts a slow-fast dual-token strategy for video input, combining the Qwen2-7B language model and ConvNeXt-576 visual encoder to achieve efficient video understanding within a limited computational budget

Model Features

Slow-Fast Dual-Token Strategy

Fast tokens quickly scan video content while slow tokens meticulously extract visual details, enabling efficient video understanding

High Frame Rate Processing

Supports 64-frame video input with significantly better temporal resolution than traditional methods

Linear Complexity Cross-Attention

Custom hybrid decoding layers enable linear-complexity cross-attention between text and raw video features

Model Capabilities

Video content understanding

Video content description generation

Multimodal reasoning

Long video processing

Use Cases

Video Content Analysis

Video Content Description

Generate detailed descriptions of input videos

Outperforms pure self-attention baselines in video understanding benchmarks

Intelligent Surveillance

Surveillance Video Analysis

Analyze key events in surveillance videos

🚀 Slow-Fast Architecture for Video Multi-Modal Large Language Models (Qwen2-7B, 64 Frames)

This repository presents the Slow-Fast Video MLLM (Qwen2-7B, ConvNeXt-576, 64 frames, stride 1/4) model, introduced in the paper Slow-Fast Architecture for Video Multi-Modal Large Language Models. It aims to tackle the challenge of balancing temporal resolution and spatial detail in video-based multi-modal large language models under limited compute resources.

Code Repository | HuggingFace Collection

✨ Features

Novel Slow-Fast Design: Inspired by human video-watching behavior, it uses a dual-token strategy. "Fast" visual tokens offer a quick overview, while "Slow" visual tokens enable instruction-aware extraction of details.
Improved Performance: Allows processing of more input frames (e.g., 64 frames) while preserving spatial details, leading to better performance on video understanding benchmarks compared to self-attention-only baselines.

📦 Installation

Note: This model relies on custom code integrated within the transformers library (LlavaQwenSlowFastForCausalLM). Ensure you have the necessary packages installed from the official repository or use trust_remote_code=True when loading the model.

First, clone the repository and install requirements if running locally:

git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
cd Slow-Fast-Video-Multimodal-LLM
pip install --upgrade pip
pip install -r requirements.txt
# Add the cloned repo path to your PYTHONPATH or install it

💻 Usage Examples

Basic Usage

import torch
import os
import numpy as np
from decord import VideoReader, cpu
import requests # Required to download video

# Make sure the necessary llava modules are importable
# If not installed from the repo, trust_remote_code=True handles this
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init


def load_video(video_path, max_frames_num):
        """Helper function to load video frames."""
        vr = VideoReader(video_path, num_threads=4)
        total_frames = len(vr)

        # Ensure sparse sampling doesn't lead to fewer frames than requested
        if total_frames >= max_frames_num:
            # Uniformly sample frames across the video
            uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
            frame_idx = uniform_sampled_frames.tolist()
        else:
            # If video is shorter than max_frames_num, sample all frames and repeat the last
            frame_idx = list(range(total_frames))
            frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))

        try:
            spare_frames = vr.get_batch(frame_idx).asnumpy()
        except Exception as e:
            print(f"Error loading video frames: {e}")
            # Fallback or error handling: return None or raise exception
            # Example: return a black frame tensor of the expected shape
            # This part depends on how image_processor handles None or errors
            # For now, re-raising the exception might be best
            raise e

        return spare_frames

# Model configuration
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
video_local_path = "catinterrupt.mp4"
question = "Please describe this video in detail."
max_frames = 64 # This checkpoint was trained with 64 frames

# Download the video if it doesn't exist
if not os.path.exists(video_local_path):
    print(f"Downloading video from {video_url}...")
    response = requests.get(video_url, stream=True)
    response.raise_for_status() # Raise an exception for bad status codes
    with open(video_local_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print("Download complete.")


# Load the model and processor
disable_torch_init()
model_name = get_model_name_from_path(model_path)

# Use trust_remote_code=True to load the custom architecture
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path,
    None,
    model_name,
    use_flash_attn=True,      # Use Flash Attention if available
    device_map="auto",        # Automatically distribute model across GPUs/CPU
    torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
    trust_remote_code=True
)

# Prepare the prompt
if model.config.mm_use_im_start_end:
    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "\n" + question
else:
    prompt = DEFAULT_IMAGE_TOKEN + "\n" + question

conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt_final = conv.get_prompt()

# Load and process video frames
print("Loading video...")
video_frames = load_video(video_local_path, max_frames_num=max_frames)
print(f"Video loaded, shape: {video_frames.shape}")

# Preprocess video frames
print("Preprocessing video...")
# Ensure video has shape (T, H, W, C) before preprocessing
video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
videos = [video_tensor] # The model expects a list of video tensors
print(f"Video tensor processed, shape: {videos[0].shape}")


# Tokenize the prompt
input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device=model.device, non_blocking=True)
# Add batch dimension if necessary (tokenizer_image_token might already return batched)
if input_ids.ndim == 1:
    input_ids = input_ids.unsqueeze(0)
print(f"Input IDs processed, shape: {input_ids.shape}")


# Generate response
print("Generating response...")
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=videos, # Pass the processed video tensor list
        do_sample=True,
        temperature=0.2,
        top_p=1.0,
        num_beams=1,
        max_new_tokens=1024,
        use_cache=True
    )

# Decode and print the output
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"\nUser input: {question}\n")
print(f"Model output:\n{outputs}")

📄 License

The model weights are released under the CC-BY-NC-4.0 license. The code is released under the Apache 2.0 license. Users must comply with all terms and conditions of the original licenses, including the specific licenses for the base language model (Qwen2 License).

📚 Documentation

Model Description

This model introduces a novel slow-fast architecture to address the challenge of balancing temporal resolution and spatial detail in video-based multi-modal large language models (MLLMs) under limited compute budgets. Existing methods often compress video representations irreversibly, losing detail.

Inspired by how humans first skim a video before focusing on relevant parts, the slow-fast design employs a dual-token strategy:

"Fast" visual tokens: A compact set of compressed video features fed into the LLM (Qwen2-7B-Instruct) alongside text embeddings for a quick overview.
"Slow" visual tokens: Uncompressed video features cross-attended by text embeddings via specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity.

This approach allows processing more input frames (e.g., 64 frames for this checkpoint) while preserving spatial details, leading to significant performance improvements on video understanding benchmarks compared to self-attention-only baselines. This checkpoint uses a Qwen2-7B-Instruct base LLM and a ConvNeXt-576 vision tower.

📖 Citation

If you find this work useful, please consider citing the paper:

@misc{zhou2025slowfast,
      title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
      author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
      year={2025},
      eprint={2504.01328},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

(Note: Author list based on potential updates to the arXiv paper; please verify with the final published version if available.)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご