S

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

Developed by shi-labs
A video multimodal large language model using a slow-fast architecture, balancing temporal resolution and spatial details, supporting 64-frame video understanding
Downloads 184
Release Time : 3/19/2025

Model Overview

This model innovatively adopts a slow-fast dual-token strategy for video input, combining the Qwen2-7B language model and ConvNeXt-576 visual encoder to achieve efficient video understanding within a limited computational budget

Model Features

Slow-Fast Dual-Token Strategy
Fast tokens quickly scan video content while slow tokens meticulously extract visual details, enabling efficient video understanding
High Frame Rate Processing
Supports 64-frame video input with significantly better temporal resolution than traditional methods
Linear Complexity Cross-Attention
Custom hybrid decoding layers enable linear-complexity cross-attention between text and raw video features

Model Capabilities

Video content understanding
Video content description generation
Multimodal reasoning
Long video processing

Use Cases

Video Content Analysis
Video Content Description
Generate detailed descriptions of input videos
Outperforms pure self-attention baselines in video understanding benchmarks
Intelligent Surveillance
Surveillance Video Analysis
Analyze key events in surveillance videos
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase