V

Videochat Flash Qwen2 5 7B InternVideo2 1B

Developed by OpenGVLab
A multimodal video-text model built upon InternVideo2-1B and Qwen2.5-7B, using only 16 tokens per frame and supporting input sequences of up to 10,000 frames.
Downloads 193
Release Time : 2/19/2025

Model Overview

This model is an efficient multimodal video-text processing model focused on video understanding and text generation tasks, particularly suitable for long video content analysis.

Model Features

Efficient video processing
Uses only 16 tokens per frame, significantly reducing computational resource requirements
Ultra-long context support
Extended to a 128k context window via Yarn technology, supporting approximately 10,000 input frames
Multimodal understanding
Combines vision and language models for in-depth understanding of video content

Model Capabilities

Video content understanding
Long video analysis
Multimodal reasoning
Video Q&A

Use Cases

Video content analysis
Long video summarization
Extracts key information and summarizes hours-long video content
Achieved 64.5% accuracy on long video benchmark tests
Video Q&A
Answers complex questions about video content
Achieved 73.4% accuracy on the MLVU dataset
Multimodal understanding
Video scene understanding
Identifies and analyzes scenes, actions, and objects in videos
Achieved 76.3% accuracy on perception tests
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase