Slowfast-video-mllm Open-source Model - Breaking Traditional Limits to Achieve Spatiotemporal Detail Understanding of Videos

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame96 S1t6

Developed by shi-labs

Adopts an innovative slow-fast architecture to balance temporal resolution and spatial details in video understanding, overcoming the sequence length limitations of traditional large language models.

Video-to-Text

Transformers

#Video Understanding #Slow-Fast Architecture #Multimodal LLM

Downloads 81

Release Time : 3/24/2025

Model Overview

This model employs a dual-token strategy: 'fast tokens' provide quick overviews, while 'slow tokens' enable instruction-aware detail extraction through cross-attention mechanisms, specifically designed for video-to-text conversion tasks.

Model Features

Dual-Token Strategy

Fast tokens provide quick overviews while slow tokens enable instruction-aware detail extraction, balancing temporal resolution and spatial details in video understanding.

Overcoming Sequence Length Limitations

Innovative architecture design overcomes the sequence length limitations of traditional large language models when processing long video sequences.

Multimodal Understanding

Capable of processing both video and text inputs simultaneously, enabling cross-modal understanding and generation.

Model Capabilities

Video content understanding

Video-to-text generation

Multimodal reasoning

Long video sequence processing

Use Cases

Video Content Analysis

Video Caption Generation

Automatically generates detailed textual descriptions based on input video content

Can produce accurate text descriptions of video content

Video Question Answering System

Answers complex questions about video content

Capable of understanding video content and providing accurate answers

Intelligent Surveillance

Surveillance Video Analysis

Automatically analyzes key events in surveillance videos

Can identify and describe important events in surveillance videos

🚀 Slow-Fast Architecture for Video Multi-Modal Large Language Models

This repository presents a model that uses a slow-fast architecture to balance temporal resolution and spatial detail in video understanding, overcoming the sequence length limitations of traditional LLMs.

🚀 Quick Start

This repository contains the model presented in the paper Slow-Fast Architecture for Video Multi-Modal Large Language Models.

Code: https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM

✨ Features

This model uses a novel slow-fast architecture to balance temporal resolution and spatial detail in video understanding, overcoming the sequence length limitations of traditional LLMs. It employs a dual-token strategy: "fast" tokens provide a quick overview, while "slow" tokens allow instruction-aware extraction of details via cross-attention.

💻 Usage Examples

Basic Usage

import torch
import os
import numpy as np
from decord import VideoReader, cpu

from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init

def load_video(video_path, max_frames_num):
        vr = VideoReader(video_path, num_threads=4)
        fps = round(vr.get_avg_fps())
        frame_idx = [i for i in range(0, len(vr), fps)]

        uniform_sampled_frames = np.linspace(0, len(vr) - 1, max_frames_num, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        spare_frames = vr.get_batch(frame_idx).asnumpy()

        return spare_frames

# Model
# Ensure you have cloned the code repository: git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4" # Or other checkpoint
video_path = "Slow-Fast-Video-Multimodal-LLM/assets/catinterrupt.mp4" # Example video path from cloned repo
question = "Please describe this video in detail."
max_frames=64 # Set according to the specific checkpoint

disable_torch_init()
model_path = os.path.expanduser(model_path)
model_name = get_model_name_from_path(model_path)
# Make sure to pass trust_remote_code=True
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, use_flash_attn=True, trust_remote_code=True)

if model.config.mm_use_im_start_end:
    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + " \n" + question
else:
    prompt = DEFAULT_IMAGE_TOKEN + " \n" + question

conv = conv_templates["qwen_1_5"].copy()
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# read and process video
video = load_video(video_path, max_frames_num=max_frames)
video_tensor = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda()
videos = [video_tensor]

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device='cuda', non_blocking=True).unsqueeze(dim=0)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=videos,
        do_sample=True,
        max_new_tokens=1024,
        num_beams=1,
        temperature=0.2,
        top_p=1.0,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"User input: {question} \n")
print(outputs)

📄 License

This project is licensed under the CC BY-NC 4.0 license.

📚 Citation

@misc{wang2025slowfast,
      title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
      author={Haotian Wang and Zhengyuan Yang and Yue Zhao and Bin Lin and Zhe Chen and Yue Cao and Hongxia Yang},
      year={2025},
      eprint={2504.01328},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.01328v1},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご