VideoChat-R1_7B開源多模態視頻理解模型 - 支持視頻文本輸入生成文本輸出

首頁

Videochat R1 7B

由OpenGVLab開發

VideoChat-R1_7B 是一個基於 Qwen2.5-VL-7B-Instruct 的多模態視頻理解模型，能夠處理視頻和文本輸入，生成文本輸出。

視頻生成文本

Transformers

英語開源協議:Apache-2.0 #視頻問答 #多模態理解 #7B參數規模

下載量 1,686

發布時間 : 4/13/2025

模型概述

該模型專注於視頻文本到文本的任務，能夠理解視頻內容並回答相關問題，適用於視頻內容分析和交互式問答場景。

模型特點

多模態視頻理解

能夠同時處理視頻和文本輸入，理解視頻內容並生成相關文本輸出。

高效視頻處理

支持最大像素460800和32幀的視頻處理能力，平衡了計算效率和視頻理解質量。

結構化輸出

支持在<answer>標籤內提供結構化答案，便於後續處理和分析。

模型能力

視頻內容理解

視頻問答

多模態推理

結構化文本生成

使用案例

視頻內容分析

視頻問答系統

用戶上傳視頻並提出問題，模型分析視頻內容並回答問題。

準確理解視頻內容並提供相關答案。

視頻內容摘要

自動生成視頻內容的文字摘要。

生成簡潔準確的視頻內容描述。

智能交互

教育輔助

學生觀看教學視頻後，通過問答方式加深理解。

提供準確的教學內容解釋和答案。

🚀 VideoChat-R1_7B

VideoChat-R1_7B 是一個多模態模型，支持視頻文本到文本的轉換。它基於 Qwen/Qwen2.5-VL-7B-Instruct 基礎模型，可用於多種視頻相關的問答任務。

🚀 快速開始

安裝指南

我們提供以下簡單的安裝示例：

pip install transformers
pip install qwen_vl_utils

使用示例

基礎用法

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1_7B"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# default processer
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Where is the final cup containing the object?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 460800,
                "nframes": 32
            },
            {"type": "text", "text": f"""{question}
            Provide your final answer within the <answer> </answer> tags.
             """},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📄 許可證

本項目採用 Apache-2.0 許可證。

✏️ 引用

如果您使用了該模型，請引用以下論文：

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}