Video-R1-7B開源多模態大模型 - 免費理解視頻內容並精準回答問題

首頁

Video R1 7B

由Video-R1開發

Video-R1-7B是基於Qwen2.5-VL-7B-Instruct優化的多模態大語言模型，專注於視頻推理任務，能夠理解視頻內容並回答相關問題。

視頻生成文本

Transformers

英語開源協議:Apache-2.0 #視頻推理增強 #多模態大語言模型 #開放題解答

下載量 2,129

發布時間 : 3/27/2025

模型概述

該模型通過強化視頻推理能力，能夠處理視頻輸入並生成文本回答，支持多種問題類型如選擇題、開放題等。

模型特點

視頻推理能力

能夠理解視頻內容並進行深入推理，回答與視頻相關的複雜問題。

多模態處理

支持視頻和文本的聯合輸入，實現多模態信息的融合處理。

自然語言推理

在推理過程中使用自然語言表達思考過程，增強可解釋性。

模型能力

視頻內容理解

多模態推理

文本生成

問題回答

使用案例

教育

視頻教學問答

學生可以上傳教學視頻並提問，模型能夠分析視頻內容並回答問題。

提高學習效率，增強對視頻內容的理解。

工業

工業視頻分析

分析工業視頻中的操作流程，回答關於操作步驟或問題原因的問題。

幫助工程師快速定位問題，提高生產效率。

🚀 Video-R1-7B模型

本倉庫包含了 Video-R1: Reinforcing Video Reasoning in MLLMs 中所介紹的 Video-R1-7B 模型。該模型可用於視頻推理相關任務，為多模態大語言模型在視頻領域的應用提供了有力支持。

🚀 快速開始

本倉庫包含了 Video-R1: Reinforcing Video Reasoning in MLLMs 中所介紹的 Video-R1-7B 模型。

若要進行訓練和評估，請參考代碼：https://github.com/tulerfeng/Video-R1

若要進行單示例推理，可參考：https://github.com/tulerfeng/Video-R1/blob/main/src/inference_example.py

💻 使用示例

基礎用法

import os
import torch
from vllm import LLM, SamplingParams
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info

# Set model path
model_path = "Video-R1/Video-R1-7B"

# Set video path and question
video_path = "./src/example_video/video1.mp4"
question = "Which move motion in the video lose the system energy?"

# Choose the question type from 'multiple choice', 'numerical', 'OCR', 'free-form', 'regression'
problem_type = 'free-form'

# Initialize the LLM
llm = LLM(
    model=model_path,
    tensor_parallel_size=1,
    max_model_len=81920,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"video": 1, "image": 1},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    max_tokens=1024,
)

# Load processor and tokenizer
processor = AutoProcessor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.padding_side = "left"
processor.tokenizer = tokenizer

# Prompt template
QUESTION_TEMPLATE = (
    "{Question}\n"
    "Please think about this question as if you were a human pondering deeply. "
    "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions "
    "It's encouraged to include self-reflection or verification in the reasoning process. "
    "Provide your detailed reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags."
)

# Question type 
TYPE_TEMPLATE = {
    "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
    "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
    "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
    "free-form": " Please provide your text answer within the <answer> </answer> tags.",
    "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags."
}

# Construct multimodal message
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 200704, # max pixels for each frame
                "nframes": 32 # max frame number
            },
            {
                "type": "text",
                "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[problem_type]
            },
        ],
    }
]

# Convert to prompt string
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Process video input
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

# Prepare vllm input
llm_inputs = [{
    "prompt": prompt,
    "multi_modal_data": {"video": video_inputs[0]},
    "mm_processor_kwargs": {key: val[0] for key, val in video_kwargs.items()},
}]

# Run inference
outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text

print(output_text)