V

Videochat Flash Qwen2 7B Res448

Developed by OpenGVLab
VideoChat-Flash-7B is a multimodal model built upon UMT-L (300M) and Qwen2-7B, using only 16 tokens per frame and supporting input sequences of up to approximately 10,000 frames.
Downloads 661
Release Time : 1/11/2025

Model Overview

This model is a multimodal video-text conversion model focused on interactive tasks between video and text, equipped with efficient video understanding and text generation capabilities.

Model Features

Efficient video processing
Uses only 16 tokens per frame, significantly improving processing efficiency.
Long sequence support
Extends the context window to 128k via Yarn, supporting input sequences of up to approximately 10,000 frames.
Multimodal capability
Combines video and text processing abilities, suitable for complex multimodal tasks.

Model Capabilities

Video understanding
Text generation
Multimodal interaction

Use Cases

Video analysis
Video QA
Answer questions based on video content.
Achieves 74.7% accuracy on the MLVU dataset.
Video summarization
Generate textual summaries of video content.
Multimodal evaluation
Multimodal benchmark testing
Conduct multimodal performance evaluations on datasets like MVBench.
Achieves 74.0% accuracy on MVBench.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase