V

Videochat Flash Qwen2 5 7B 1M Res224

Developed by OpenGVLab
VideoChat-Flash is a multimodal model built upon UMT-L and Qwen2.5-7B-1M, supporting long video understanding with an extended context window of 1M.
Downloads 64
Release Time : 2/19/2025

Model Overview

This model focuses on multimodal interaction between video and text, capable of processing video inputs of up to approximately 50,000 frames, suitable for video understanding and analysis tasks.

Model Features

Efficient Long Video Processing
Extends the context window to 1M via Yarn technology, supporting video inputs of up to approximately 50,000 frames.
Low-Token Consumption
Uses only 16 tokens per frame for efficient video content understanding.
Multimodal Capability
Combines visual and language comprehension for video-text interaction.

Model Capabilities

Video Content Understanding
Multimodal Interaction
Long Video Processing
Text Generation

Use Cases

Video Analysis
Video Question Answering
Answer questions based on video content
Achieves 74.1% accuracy on the MLVU dataset
Video Content Understanding
Understand and describe long video content
Achieves 66.5% accuracy on LongVideoBench
Multimodal Testing
Perception Testing
Evaluation of multimodal perception capabilities
Achieves 75.4% accuracy on Perception Test
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase