VideoChat-R1_7B_caption開源多模態模型 - 輕鬆實現視頻內容理解與描述生成

首頁

Videochat R1 7B Caption

由OpenGVLab開發

VideoChat-R1_7B_caption 是一個基於 Qwen2-VL-7B-Instruct 的多模態視頻文本生成模型，專注於視頻內容理解和描述生成。

視頻生成文本

Transformers

英語開源協議:Apache-2.0 #視頻內容理解 #多模態問答 #高精度描述生成

下載量 48

發布時間 : 4/22/2025

模型概述

該模型能夠處理視頻輸入並生成詳細的文本描述，適用於視頻內容分析和理解任務。

模型特點

多模態理解

能夠同時處理視頻和文本輸入，理解視頻內容並生成相關描述。

詳細描述生成

可以生成對視頻內容的詳細描述，包括場景、動作和事件。

思考過程可視化

在生成最終答案前，模型會在<think>標籤中輸出思考過程，提高可解釋性。

模型能力

視頻內容理解

文本描述生成

多模態處理

使用案例

視頻分析

視頻內容描述

為視頻生成詳細的文本描述

準確描述視頻中的場景、人物和動作

輔助工具

視頻摘要

為長視頻生成簡潔摘要

提取視頻關鍵信息，生成簡短摘要

🚀 視頻聊天-R1_7B字幕模型

VideoChat-R1_7B_caption 是一款支持視頻文本到文本轉換的多模態模型，基於 Qwen/Qwen2-VL-7B-Instruct 基礎模型構建，可用於詳細描述視頻內容。

🚀 快速開始

我們提供了簡單的安裝示例：

pip install transformers
pip install qwen_vl_utils

然後，你可以使用我們的模型：

💻 使用示例

基礎用法

from transformers import Qwen2_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1_7B_caption"
# 默認：將模型加載到可用設備上
model = Qwen2_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# 默認處理器
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Describe the video in detail."

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": f""""{question} First output the thinking process in <think> </think> tags and then output the final answer in <answer> </answer> tags"""},
        ],
    }
]

# 在Qwen 2 VL中，幀率信息也會輸入到模型中以與絕對時間對齊。
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📄 許可證

本項目採用 Apache-2.0 許可證。

✏️ 引用

如果你使用了該模型，請引用以下論文：

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}