VideoChat-R1_7B Open-Source Multimodal Video Understanding Model - Supports Video and Text Input to Generate Text Output

Videochat R1 7B

Developed by OpenGVLab

VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.

Video-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Video Q&A #Multimodal understanding #7B parameter scale

Downloads 1,686

Release Time : 4/13/2025

Model Overview

This model focuses on video text-to-text tasks, can understand video content and answer related questions, and is suitable for video content analysis and interactive Q&A scenarios.

Model Features

Multimodal video understanding

Capable of simultaneously processing video and text inputs, understanding video content and generating relevant text outputs.

Efficient video processing

Supports video processing capabilities with a maximum of 460,800 pixels and 32 frames, balancing computational efficiency and video understanding quality.

Structured output

Supports providing structured answers within the <answer> tag for easy subsequent processing and analysis.

Model Capabilities

Video content understanding

Video Q&A

Multimodal reasoning

Structured text generation

Use Cases

Video content analysis

Video Q&A system

Users upload videos and ask questions, and the model analyzes the video content and answers the questions.

Accurately understand video content and provide relevant answers.

Video content summarization

Automatically generate text summaries of video content.

Generate concise and accurate descriptions of video content.

Intelligent interaction

Educational assistance

After students watch teaching videos, they can deepen their understanding through Q&A.

Provide accurate explanations and answers for teaching content.

🚀 VideoChat-R1_7B

VideoChat-R1_7B is a multimodal model based on the Qwen2.5-VL-7B-Instruct base model, designed for video - text - to - text tasks, offering high - accuracy performance.

[📂 GitHub]
[📜 Tech Report]

🚀 Quick Start

We provide a simple installation example below:

pip install transformers
pip install qwen_vl_utils

Then you could use our model:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1_7B"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# default processer
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Where is the final cup containing the object?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 460800,
                "nframes": 32
            },
            {"type": "text", "text": f"""{question}
            Provide your final answer within the <answer> </answer> tags.
             """},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📦 Installation

We provide the following installation commands:

pip install transformers
pip install qwen_vl_utils

💻 Usage Examples

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1_7B"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# default processer
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Where is the final cup containing the object?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 460800,
                "nframes": 32
            },
            {"type": "text", "text": f"""{question}
            Provide your final answer within the <answer> </answer> tags.
             """},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📄 License

This project is licensed under the Apache 2.0 license.

✏️ Citation

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}

📚 Documentation

Property	Details
Library Name	transformers
Metrics	accuracy
Tags	multimodal
Pipeline Tag	video - text - to - text
Base Model	Qwen/Qwen2.5-VL-7B-Instruct
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご