VideoChat-Flash-Qwen2-7B_res448 Open-Source Multimodal Model - Ultra-Efficient in Long-Frame Video Input Processing

Videochat Flash Qwen2 7B Res448

Developed by OpenGVLab

VideoChat-Flash-7B is a multimodal model built upon UMT-L (300M) and Qwen2-7B, using only 16 tokens per frame and supporting input sequences of up to approximately 10,000 frames.

Video-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Long-video understanding #Low-token multimodal #128k context window

Downloads 661

Release Time : 1/11/2025

Model Overview

This model is a multimodal video-text conversion model focused on interactive tasks between video and text, equipped with efficient video understanding and text generation capabilities.

Model Features

Efficient video processing

Uses only 16 tokens per frame, significantly improving processing efficiency.

Long sequence support

Extends the context window to 128k via Yarn, supporting input sequences of up to approximately 10,000 frames.

Multimodal capability

Combines video and text processing abilities, suitable for complex multimodal tasks.

Model Capabilities

Video understanding

Text generation

Multimodal interaction

Use Cases

Video analysis

Video QA

Answer questions based on video content.

Achieves 74.7% accuracy on the MLVU dataset.

Video summarization

Generate textual summaries of video content.

Multimodal evaluation

Multimodal benchmark testing

Conduct multimodal performance evaluations on datasets like MVBench.

Achieves 74.0% accuracy on MVBench.

🚀 VideoChat-Flash-Qwen2-7B_res448

VideoChat-Flash-Qwen2-7B_res448 is a multimodal model built on UMT-L (300M) and Qwen2-7B. It uses only 16 tokens per frame and can handle input sequences of up to about 10,000 frames by extending the context window to 128k.

🚀 Quick Start

Prerequisites

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional 
pip install flash-attn --no-build-isolation

Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2-7B_res448'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✨ Features

Efficient Token Usage: Employing only 16 tokens per frame, it optimizes computational resources.
Extended Context Window: Supports input sequences of up to approximately 10,000 frames by extending the context window to 128k.

📦 Installation

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional 
pip install flash-attn --no-build-isolation

📈 Performance

Model	MVBench	LongVideoBench	VideoMME(w/o sub)	Max Input Frames
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

📚 Documentation

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

⚠️ Important Note

Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension. To ensure optimal performance, using English for interaction is recommended.

✏️ Citation

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご