Video-R1-7Bオープンソースマルチモーダル大規模モデル - 無料でビデオ内容を理解し、質問に正確に回答

ホーム

Video R1 7B

Video-R1によって開発

Video-R1-7BはQwen2.5-VL-7B-Instructを最適化したマルチモーダル大規模言語モデルで、ビデオ推論タスクに特化しており、ビデオコンテンツを理解し関連する質問に答えることができます。

ビデオ生成テキスト

Transformers

英語オープンソースライセンス:Apache-2.0 #ビデオ推論強化 #マルチモーダル大規模言語モデル #オープン質問解答

ダウンロード数 2,129

リリース時間 : 3/27/2025

モデル概要

このモデルはビデオ推論能力を強化することで、ビデオ入力を処理しテキスト回答を生成でき、選択問題やオープン質問など様々な問題タイプをサポートします。

モデル特徴

ビデオ推論能力

ビデオコンテンツを理解し深い推論を行い、ビデオに関連する複雑な質問に答えることができます。

マルチモーダル処理

ビデオとテキストの共同入力をサポートし、マルチモーダル情報の融合処理を実現します。

自然言語推論

推論プロセスで自然言語を使用して思考過程を表現し、説明可能性を高めます。

モデル能力

ビデオコンテンツ理解

マルチモーダル推論

テキスト生成

質問応答

使用事例

教育

ビデオ教育Q&A

学生は教育ビデオをアップロードして質問でき、モデルはビデオコンテンツを分析し質問に答えます。

学習効率を向上させ、ビデオコンテンツの理解を深めます。

産業

産業ビデオ分析

産業ビデオ中の操作プロセスを分析し、操作手順や問題原因に関する質問に答えます。

エンジニアが問題を迅速に特定し、生産効率を向上させるのに役立ちます。

🚀 Video-R1-7Bモデル

このリポジトリには、論文 Video-R1: Reinforcing Video Reasoning in MLLMs で提示されたVideo-R1-7Bモデルが含まれています。このモデルは、ビデオテキストをテキストに変換するタスクに特化しており、transformersライブラリを使用して開発されています。

🚀 クイックスタート

このモデルのトレーニングと評価については、こちらのコードを参照してください。単一のサンプルで推論を行う場合は、こちらのサンプルコードを参照してください。

💻 使用例

基本的な使用法

import os
import torch
from vllm import LLM, SamplingParams
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info

# Set model path
model_path = "Video-R1/Video-R1-7B"

# Set video path and question
video_path = "./src/example_video/video1.mp4"
question = "Which move motion in the video lose the system energy?"

# Choose the question type from 'multiple choice', 'numerical', 'OCR', 'free-form', 'regression'
problem_type = 'free-form'

# Initialize the LLM
llm = LLM(
    model=model_path,
    tensor_parallel_size=1,
    max_model_len=81920,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"video": 1, "image": 1},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    max_tokens=1024,
)

# Load processor and tokenizer
processor = AutoProcessor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.padding_side = "left"
processor.tokenizer = tokenizer

# Prompt template
QUESTION_TEMPLATE = (
    "{Question}\n"
    "Please think about this question as if you were a human pondering deeply. "
    "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions "
    "It's encouraged to include self-reflection or verification in the reasoning process. "
    "Provide your detailed reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags."
)

# Question type 
TYPE_TEMPLATE = {
    "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
    "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
    "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
    "free-form": " Please provide your text answer within the <answer> </answer> tags.",
    "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags."
}

# Construct multimodal message
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 200704, # max pixels for each frame
                "nframes": 32 # max frame number
            },
            {
                "type": "text",
                "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[problem_type]
            },
        ],
    }
]

# Convert to prompt string
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Process video input
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

# Prepare vllm input
llm_inputs = [{
    "prompt": prompt,
    "multi_modal_data": {"video": video_inputs[0]},
    "mm_processor_kwargs": {key: val[0] for key, val in video_kwargs.items()},
}]

# Run inference
outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text

print(output_text)

📄 ライセンス

このプロジェクトは、Apache-2.0ライセンスの下で公開されています。

📚 ドキュメント

プロパティ	詳細
パイプラインタグ	ビデオテキストをテキストに変換
ライブラリ名	transformers
データセット	Video-R1/Video-R1-data
言語	en
評価指標	正解率
ベースモデル	Qwen/Qwen2.5-VL-7B-Instruct