VideoChat-Flash-Qwen2_5-2B_res448 Open-Source Multimodal Model - Achieving Efficient Video-to-Text Conversion

Videochat Flash Qwen2 5 2B Res448

Developed by OpenGVLab

VideoChat-Flash-2B is a multimodal model built upon UMT-L (300M) and Qwen2.5-1.5B, supporting video-to-text tasks with only 16 tokens per frame and extending the context window to 128k.

Video-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Ultra-long video understanding #Low token consumption #Multimodal Q&A

Downloads 904

Release Time : 1/11/2025

Model Overview

This model specializes in multimodal tasks, particularly video-to-text conversion, capable of processing long video inputs (up to approximately 10,000 frames).

Model Features

Efficient video processing

Uses only 16 tokens per frame, significantly reducing computational resource requirements.

Long video support

Extends the context window to 128k via Yarn, supporting input sequences of up to approximately 10,000 frames.

Multimodal capability

Combines vision and language models for efficient conversion between video and text.

Model Capabilities

Video-to-text conversion

Multimodal understanding

Long video processing

Use Cases

Video analysis

Video content understanding

Analyze video content and generate textual descriptions.

Achieves 65.7% accuracy on the MLVU dataset.

Long video processing

Process long videos and extract key information.

Achieves 58.3% accuracy on the long video benchmark.

Multimodal testing

Perception testing

Conduct multimodal perception capability tests.

Achieves 70.5% accuracy on perception tests.

🚀 VideoChat-Flash-Qwen2_5-2B_res448⚡

VideoChat-Flash-2B is a model built on UMT-L (300M) and Qwen2.5-1.5B, using only 16 tokens per frame. By extending the context window to 128k with Yarn (Qwen2's native context window is 32k), it can support input sequences of up to about 10,000 frames. This model is suitable for multimodal tasks, especially video - text - to - text processing.

⚠️ Important Note

Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension. To ensure optimal performance, using English for interaction is recommended.

🚀 Quick Start

Installation

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install timm
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional
pip install flash-attn --no-build-isolation

Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✨ Features

Efficient Token Utilization: Uses only 16 tokens per frame, optimizing resource usage.
Extended Context Window: Supports input sequences of up to about 10,000 frames by extending the context window to 128k.

📈 Performance

Model	MVBench	LongVideoBench	VideoMME(w/o sub)	Max Input Frames
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

📚 Documentation

📄 License

This project is licensed under the Apache-2.0 license.

✏️ Citation

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}

📋 Model Details

Property	Details
Model Type	VideoChat-Flash-Qwen2_5-2B_res448
Training Data	Not provided
Metrics	Accuracy on MVBench, LongVideoBench, VideoMME (w/o sub), etc.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご