Qwen2.5 - vl - 7b - cam - motion - previewオープンソースモデル - ビデオカメラの動き分類とテキスト検索を支援

ホーム

Qwen2.5 Vl 7b Cam Motion Preview

chancharikmによって開発

Qwen2.5-VL-7B-Instructをファインチューニングしたカメラ運動分析モデルで、ビデオ内のカメラ運動分類とビデオ-テキスト検索タスクに特化

ビデオ生成テキスト

Transformers

オープンソースライセンス:その他 #カメラ運動分類 #ビデオ-テキスト検索 #マルチモーダル理解

ダウンロード数 1,456

リリース時間 : 4/28/2025

モデル概要

このモデルはカメラ運動分析タスク向けに最適化されたマルチモーダルモデルで、ビデオ内のカメラ運動タイプを識別し、ビデオとテキスト記述のマッチング度を評価可能

モデル特徴

カメラ運動識別

ズーム、パン、チルトなどビデオ内の各種カメラ運動を正確に識別可能

ビデオ-テキストマッチング評価

ビデオ内容とテキスト記述のマッチング度スコアを計算し、検索タスクに利用可能

マルチモーダル理解

ビデオとテキスト入力を同時処理し、クロスモーダル理解を実現

高性能ベンチマーク

CameraBenchにおいてカメラ運動分類と検索タスクで現在のSOTA性能を達成

モデル能力

ビデオ内容分析

カメラ運動分類

ビデオ-テキストマッチングスコアリング

マルチモーダル推論

自然言語生成

使用事例

ビデオ分析

カメラ運動分類

ビデオクリップ内のカメラ運動タイプを自動識別

ズーム、パン、チルトなどの一般的なカメラ運動を正確分類

ビデオ検索

テキスト記述に基づきマッチングするビデオクリップを検索

ビデオとテキスト記述のマッチング度スコアを提供

映像制作

ショット分析

映像作品におけるショットの使用を分析

監督のショット言語理解を支援

🚀 bal_imb_cap_full_lr2e-4_epoch10.0_freezevisTrue_fps8

このモデルは、現在公開されている最高品質のカメラモーションデータセットを使用して、Qwen/Qwen2.5-VL-7B-Instruct をファインチューニングしたものです。このプレビューモデルは、カメラモーションの分類や、カメラモーションのキャプションを用いたビデオ-テキスト検索において、現在の最先端技術（SOTA）を達成しています。VQAScore を使用して評価されています。詳細な情報は、CameraBench のGitHubページをご覧ください。将来的には、ベンチマークとモデルの更新が予定されています。ご期待ください！

🚀 クイックスタート

モデル概要

このモデルは、Qwen/Qwen2.5-VL-7B-Instruct を現在公開されている最高品質のカメラモーションデータセットでファインチューニングしたものです。このプレビューモデルは、カメラモーションの分類や、カメラモーションのキャプションを用いたビデオ-テキスト検索において、現在の最先端技術（SOTA）を達成しています。VQAScore を使用して評価されています。詳細な情報は、CameraBench のGitHubページをご覧ください。将来的には、ベンチマークとモデルの更新が予定されています。ご期待ください！

想定される用途と制限

このモデルの使用方法は、Qwen2.5-VL モデルと同じです。主に、ビデオ内のカメラモーションの分類や、ビデオ-テキスト検索に有用です（両タスクにおいて現在のSOTA）。

以下に簡単なデモを示します：

生成スコアリング（分類と検索用）：

# Import necessary libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare input data
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"

# Format the input for the model
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "fps": 8.0,  # Recommended FPS for optimal inference
            },
            {"type": "text", "text": question},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs
)
inputs = inputs.to("cuda")

# Generate with score output
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1,
        do_sample=False,  # Use greedy decoding to get reliable logprobs
        output_scores=True,
        return_dict_in_generate=True
    )

# Calculate probability of "Yes" response
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()

print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")

自然言語生成

# The model is trained on 8.0 FPS which we recommend for optimal inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "chancharikm/qwen2.5-vl-7b-cam-motion-preview",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "fps": 8.0,
            },
            {"type": "text", "text": "Describe the camera motion in this video."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📦 インストール

インストールに関する具体的な手順は提供されていません。

💻 使用例

基本的な使用法

生成スコアリング（分類と検索用）のコード例を以下に示します：

# Import necessary libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare input data
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"

# Format the input for the model
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "fps": 8.0,  # Recommended FPS for optimal inference
            },
            {"type": "text", "text": question},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs
)
inputs = inputs.to("cuda")

# Generate with score output
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1,
        do_sample=False,  # Use greedy decoding to get reliable logprobs
        output_scores=True,
        return_dict_in_generate=True
    )

# Calculate probability of "Yes" response
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()

print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")

高度な使用法

自然言語生成のコード例を以下に示します：

# The model is trained on 8.0 FPS which we recommend for optimal inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "chancharikm/qwen2.5-vl-7b-cam-motion-preview",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "fps": 8.0,
            },
            {"type": "text", "text": "Describe the camera motion in this video."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📚 ドキュメント

学習と評価データ

学習と評価データは、こちらのリポジトリで確認できます。

学習手順

LLaMA-Factoryのコードベースを使用してモデルをファインチューニングしています。必要に応じて、上記のデータと以下のハイパーパラメータを使用して再現することができます。

学習ハイパーパラメータ

学習中に使用されたハイパーパラメータは以下の通りです：

learning_rate: 1e-05
train_batch_size: 4
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 256
total_eval_batch_size: 8
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10.0

🔧 技術詳細

技術的な詳細に関する具体的な説明は提供されていません。

📄 ライセンス

このモデルのライセンスは other です。

✏️ 引用

このリポジトリがあなたの研究に役立った場合は、以下の引用を使用してください。

@article{lin2025camerabench,
  title={Towards Understanding Camera Motions in Any Video},
  author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
  journal={arXiv preprint arXiv:2504.15376},
  year={2025},
}