VideoChat-R1-thinking_7B Open-source Multimodal Model - Freely Achieve Video-to-Text Tasks

Home

Videochat R1 Thinking 7B

Developed by OpenGVLab

VideoChat-R1-thinking_7B is a multimodal model based on Qwen2.5-VL-7B-Instruct, focusing on video-text-to-text tasks.

Video-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Video Text Understanding #Multimodal Interaction #7B Parameter Scale

Downloads 800

Release Time : 4/13/2025

Model Overview

This model combines visual and language processing capabilities to understand and generate text descriptions related to video content.

Model Features

Multimodal Processing

Capable of processing both video and text information, enabling cross-modal understanding and generation.

High Accuracy

Demonstrates high accuracy in video-text-to-text tasks.

Instruction Following

Supports instruction-based interaction and can generate relevant text based on user instructions.

Model Capabilities

Video Content Understanding

Text Generation

Multimodal Reasoning

Use Cases

Video Content Analysis

Video Summarization

Generate concise text summaries based on video content.

Produces accurate and coherent video summaries.

Video Question Answering

Answer specific questions about video content.

Provides accurate answers related to the video content.

Education

Educational Video Assistance

Generate auxiliary text or subtitles for educational videos.

Enhances the accessibility and comprehensibility of educational videos.

🚀 VideoChat-R1-thinking_7B

VideoChat-R1-thinking_7B is a multimodal model for video - text processing, enabling video - text - to - text tasks with high accuracy.

🚀 Quick Start

📦 Installation

We provide a simple installation example below:

pip install transformers
pip install qwen_vl_utils

💻 Usage Examples

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1-thinking_7B"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# default processer
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Where is the final cup containing the object?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": f"""{question}
             
             Output your thought process within the <think> </think> tags, including analysis with either specific timestamps (xx.xx) or time ranges (xx.xx to xx.xx) in <timestep> </timestep> tags.

            Then, provide your final answer within the <answer> </answer> tags.
             """},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📄 License

This project is licensed under the Apache - 2.0 license.

✏️ Citation

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}

🔗 Links

📋 Information Table

Property	Details
Library Name	transformers
Model Type	Video - text - to - text
Base Model	Qwen/Qwen2.5 - VL - 7B - Instruct
Metrics	accuracy
Tags	multimodal
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご