VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B Open-Source Multimodal Model - Ultra-Efficient for Long Video Text Processing

Videochat Flash Qwen2 5 7B InternVideo2 1B

Developed by OpenGVLab

A multimodal video-text model built upon InternVideo2-1B and Qwen2.5-7B, using only 16 tokens per frame and supporting input sequences of up to 10,000 frames.

Text-to-Video

Transformers

EnglishOpen Source License:Apache-2.0 #Ultra-long video understanding #Efficient video tagging #Multimodal Q&A

Downloads 193

Release Time : 2/19/2025

Model Overview

This model is an efficient multimodal video-text processing model focused on video understanding and text generation tasks, particularly suitable for long video content analysis.

Model Features

Efficient video processing

Uses only 16 tokens per frame, significantly reducing computational resource requirements

Ultra-long context support

Extended to a 128k context window via Yarn technology, supporting approximately 10,000 input frames

Multimodal understanding

Combines vision and language models for in-depth understanding of video content

Model Capabilities

Video content understanding

Long video analysis

Multimodal reasoning

Video Q&A

Use Cases

Video content analysis

Long video summarization

Extracts key information and summarizes hours-long video content

Achieved 64.5% accuracy on long video benchmark tests

Video Q&A

Answers complex questions about video content

Achieved 73.4% accuracy on the MLVU dataset

Multimodal understanding

Video scene understanding

Identifies and analyzes scenes, actions, and objects in videos

Achieved 76.3% accuracy on perception tests

🚀 VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B⚡

VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B is a multimodal model built on InternVideo2-1B and Qwen2.5-7B. It uses only 16 tokens per frame and can support input sequences of up to about 10,000 frames by extending the context window to 128k with Yarn.

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

⚠️ Important Note

Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension. To ensure optimal performance, using English for interaction is recommended.

🚀 Quick Start

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# optional
pip install flash-attn --no-build-isolation

Then you could use our model:

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✨ Features

VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B is constructed upon InternVideo2-1B and Qwen2.5-7B, employing only 16 tokens per frame. By leveraging Yarn to extend the context window to 128k (Qwen2's native context window is 32k), our model supports input sequences of up to approximately 10,000 frames.

📈 Performance

Model	MVBench	LongVideoBench	VideoMME(w/o sub)	Max Input Frames
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

📚 Documentation

Model Information

Property	Details
Model Type	VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B
Training Data	Not provided

Evaluation Results

Task Type	Dataset Name	Accuracy
Multimodal	MLVU	73.4
Multimodal	MVBench	74.3
Multimodal	Perception Test	76.3
Multimodal	LongVideoBench	64.5
Multimodal	VideoMME (wo sub)	65.2
Multimodal	LVBench	48.7

✏️ Citation


@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご