V

Videochat Flash Qwen2 7B Res224

Developed by OpenGVLab
A multimodal model built on UMT-L and Qwen2-7B, supporting long video understanding with only 16 tokens per frame and an extended context window of 128k.
Downloads 80
Release Time : 1/11/2025

Model Overview

VideoChat-Flash-7B is an efficient multimodal model focused on video-to-text tasks, capable of processing input sequences of up to approximately 10,000 frames.

Model Features

Efficient Video Processing
Uses only 16 tokens per frame, significantly reducing computational resource requirements.
Long Video Support
Extends the context window to 128k via Yarn, supporting input sequences of up to approximately 10,000 frames.
Multimodal Understanding
Combines vision and language models for deep understanding of video content.

Model Capabilities

Video Content Understanding
Multimodal Reasoning
Long Video Processing
Text Generation

Use Cases

Video Analysis
Video QA
Answer questions based on video content.
Achieves 74.5% accuracy on the MLVU dataset.
Video Content Summarization
Generate text summaries of video content.
Achieves 64.2% accuracy on the LongVideoBench dataset.
Multimodal Reasoning
Visual QA
Perform reasoning by combining video and text information.
Achieves 75.6% accuracy on the Perception Test dataset.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase