Qwen2.5-vl-7b-cam-motion-preview開源模型 - 助力視頻攝像機運動分類與文本檢索

首頁

Qwen2.5 Vl 7b Cam Motion Preview

由chancharikm開發

基於Qwen2.5-VL-7B-Instruct微調的攝像機運動分析模型，專注於視頻中的攝像機運動分類和視頻-文本檢索任務

視頻生成文本

Transformers

開源協議:其他 #攝像機運動分類 #視頻-文本檢索 #多模態理解

下載量 1,456

發布時間 : 4/28/2025

模型概述

該模型是針對攝像機運動分析任務優化的多模態模型，能夠識別視頻中的攝像機運動類型並評估視頻與文本描述的匹配度

模型特點

攝像機運動識別

能夠準確識別視頻中的各類攝像機運動，如推拉、搖移、傾斜等

視頻-文本匹配評估

可計算視頻內容與文本描述的匹配度評分，用於檢索任務

多模態理解

同時處理視頻和文本輸入，實現跨模態理解

高性能基準

在CameraBench上達到當前攝像機運動分類和檢索任務的SOTA性能

模型能力

視頻內容分析

攝像機運動分類

視頻-文本匹配評分

多模態推理

自然語言生成

使用案例

視頻分析

攝像機運動分類

自動識別視頻片段中的攝像機運動類型

準確分類推拉、搖移、傾斜等常見攝像機運動

視頻檢索

根據文本描述查找匹配的視頻片段

提供視頻與文本描述的匹配度評分

影視製作

鏡頭分析

分析影視作品中的鏡頭運用

幫助理解導演的鏡頭語言

🚀 視頻文本處理模型

本項目基於預訓練模型微調，在視頻相機運動分類和視頻文本檢索任務上達到當前最優水平，可用於判斷視頻是否包含特定相機運動描述等場景。

🚀 快速開始

本模型的使用方法與 Qwen2.5-VL 模型相同，主要用於視頻中的相機運動分類以及視頻文本檢索（在這兩個任務中均為當前最優）。

✨ 主要特性

基於 Qwen/Qwen2.5-VL-7B-Instruct 模型在公開的高質量相機運動數據集上進行微調。
在相機運動分類和視頻文本檢索任務上達到當前最優水平。

📦 安裝指南

文檔未提及安裝步驟，可參考 Qwen2.5-VL 模型的安裝方法。

💻 使用示例

基礎用法

# Import necessary libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare input data
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"

# Format the input for the model
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "fps": 8.0,  # Recommended FPS for optimal inference
            },
            {"type": "text", "text": question},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs
)
inputs = inputs.to("cuda")

# Generate with score output
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1,
        do_sample=False,  # Use greedy decoding to get reliable logprobs
        output_scores=True,
        return_dict_in_generate=True
    )

# Calculate probability of "Yes" response
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()

print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")

高級用法

# The model is trained on 8.0 FPS which we recommend for optimal inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "chancharikm/qwen2.5-vl-7b-cam-motion-preview",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "fps": 8.0,
            },
            {"type": "text", "text": "Describe the camera motion in this video."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📚 詳細文檔

訓練和評估數據

訓練和評估數據可在我們的倉庫中找到。

訓練過程

我們使用 LLaMA-Factory 代碼庫對模型進行微調。如果需要復現我們的工作，請使用上述數據和以下超參數。

訓練超參數

以下是訓練過程中使用的超參數：

學習率：1e-05
訓練批次大小：4
評估批次大小：1
隨機種子：42
分佈式類型：多 GPU
設備數量：8
梯度累積步數：8
總訓練批次大小：256
總評估批次大小：8
優化器：使用 adamw_torch，β1=0.9，β2=0.999，ε=1e-08，無額外優化器參數
學習率調度器類型：餘弦
學習率調度器熱身比例：0.1
訓練輪數：10.0

🔧 技術細節

本模型是 Qwen/Qwen2.5-VL-7B-Instruct 的微調版本，在當前公開的最高質量相機運動數據集上進行訓練。該預覽模型在相機運動分類和使用 VQAScore 進行相機運動字幕的視頻文本檢索任務中達到了當前最優水平。更多關於我們工作的信息可在我們的 CameraBench Github 頁面上找到。

📄 許可證

許可證類型：其他

✏️ 引用

如果您發現本倉庫對您的研究有用，請使用以下引用：

@article{lin2025camerabench,
  title={Towards Understanding Camera Motions in Any Video},
  author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
  journal={arXiv preprint arXiv:2504.15376},
  year={2025},
}