V

Videochat Flash Qwen2 5 2B Res448

Developed by OpenGVLab
VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.
Downloads 904
Release Time : 1/11/2025

Model Overview

This model specializes in multimodal tasks, particularly video-to-text conversion, capable of processing long video inputs (up to approximately 10,000 frames).

Model Features

Efficient video processing
Uses only 16 tokens per frame, significantly reducing computational resource requirements.
Long video support
Extends the context window to 128k via Yarn, supporting input sequences of up to approximately 10,000 frames.
Multimodal capability
Combines vision and language models for efficient conversion between video and text.

Model Capabilities

Video-to-text conversion
Multimodal understanding
Long video processing

Use Cases

Video analysis
Video content understanding
Analyze video content and generate textual descriptions.
Achieves 65.7% accuracy on the MLVU dataset.
Long video processing
Process long videos and extract key information.
Achieves 58.3% accuracy on the long video benchmark.
Multimodal testing
Perception testing
Conduct multimodal perception capability tests.
Achieves 70.5% accuracy on perception tests.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase