VideoChat-R1-thinking_7Bオープンソースマルチモーダルモデル - 無料で動画テキストをテキストに変換するタスクを実現

Home

Videochat R1 Thinking 7B

Developed by OpenGVLab

VideoChat-R1-thinking_7B は Qwen2.5-VL-7B-Instruct をベースにしたマルチモーダルモデルで、動画テキストからテキストへの変換タスクに特化しています。

ビデオ生成テキスト

Transformers

EnglishOpen Source License:Apache-2.0 #動画テキスト理解 #マルチモーダルインタラクション #7Bパラメータ規模

Downloads 800

Release Time : 4/13/2025

Model Overview

このモデルは視覚と言語処理能力を組み合わせ、動画コンテンツに関連するテキスト記述を理解し生成できます。

Model Features

マルチモーダル処理

動画とテキスト情報を同時に処理し、クロスモーダルな理解と生成を実現します。

高精度

動画テキストからテキストへの変換タスクで高い精度を示します。

命令追従

命令型インタラクションをサポートし、ユーザーの指示に基づいて関連テキストを生成できます。

Model Capabilities

動画コンテンツ理解

テキスト生成

マルチモーダル推論

Use Cases

動画コンテンツ分析

動画要約生成

動画コンテンツに基づいて簡潔なテキスト要約を生成します。

正確で一貫性のある動画要約を生成します。

動画質問応答

動画コンテンツに関する特定の質問に答えます。

動画コンテンツに関連する正確な回答を提供します。

教育

教育動画支援

教育動画の補助テキストや字幕を生成します。

教育動画のアクセシビリティと理解度を向上させます。

🚀 VideoChat-R1-thinking_7B

VideoChat-R1-thinking_7Bは、マルチモーダルなビデオとテキストを扱うモデルです。このモデルは、Qwen/Qwen2.5-VL-7B-Instructをベースに構築されており、ビデオとテキストの相互作用を通じた高度な理解と応答生成が可能です。

[📂 GitHub]
[📜 Tech Report]

🚀 クイックスタート

📦 インストール

以下に簡単なインストール例を示します。

pip install transformers
pip install qwen_vl_utils

💻 使用例

基本的な使用法

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1-thinking_7B"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# default processer
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Where is the final cup containing the object?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": f"""{question}
             
             Output your thought process within the <think> </think> tags, including analysis with either specific timestamps (xx.xx) or time ranges (xx.xx to xx.xx) in <timestep> </timestep> tags.

            Then, provide your final answer within the <answer> </answer> tags.
             """},
        ],
    }
]

#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📄 ライセンス

このプロジェクトはApache-2.0ライセンスの下で公開されています。

✏️ 引用

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}