Qwen2.5-vl-7b-cam-motion-preview Open-source Model - Empowering Video Camera Motion Classification and Text Retrieval

Qwen2.5 Vl 7b Cam Motion Preview

Developed by chancharikm

A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks

Video-to-Text

Transformers

Open Source License:Other #Camera Motion Classification #Video-Text Retrieval #Multimodal Understanding

Downloads 1,456

Release Time : 4/28/2025

Model Overview

This is a multimodal model optimized for camera motion analysis tasks, capable of identifying camera motion types in videos and evaluating the matching degree between videos and text descriptions

Model Features

Camera Motion Recognition

Accurately identifies various camera motions in videos, such as dolly, pan, tilt, etc.

Video-Text Matching Evaluation

Calculates matching scores between video content and text descriptions for retrieval tasks

Multimodal Understanding

Processes both video and text inputs simultaneously for cross-modal understanding

High-Performance Benchmark

Achieves SOTA performance on CameraBench for camera motion classification and retrieval tasks

Model Capabilities

Video content analysis

Camera motion classification

Video-text matching scoring

Multimodal reasoning

Natural language generation

Use Cases

Video Analysis

Camera Motion Classification

Automatically identifies camera motion types in video clips

Accurately classifies common camera motions like dolly, pan, tilt, etc.

Video Retrieval

Finds matching video clips based on text descriptions

Provides matching scores between videos and text descriptions

Film Production

Shot Analysis

Analyzes shot techniques in film productions

Helps understand the director's cinematography language

🚀 Qwen2.5-VL-7B Camera Motion Model

This model is a fine - tuned version of Qwen/Qwen2.5-VL-7B-Instruct. It is trained on the most high - quality public camera motion dataset. It achieves the current SOTA for camera motion classification and video - text retrieval with camera motion captions using VQAScore.

🚀 Quick Start

The usage of this model is the same as a Qwen2.5-VL model. It is mainly useful for camera motion classification in videos and video - text retrieval, where it currently holds the SOTA position in both tasks.

✨ Features

Fine - tuned on high - quality public camera motion dataset.
Current SOTA for camera motion classification and video - text retrieval with camera motion captions.

📦 Installation

The installation process is related to the transformers library. You can install it via the following command:

pip install transformers

💻 Usage Examples

Basic Usage

# Import necessary libraries
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare input data
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"

# Format the input for the model
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "fps": 8.0,  # Recommended FPS for optimal inference
            },
            {"type": "text", "text": question},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs
)
inputs = inputs.to("cuda")

# Generate with score output
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1,
        do_sample=False,  # Use greedy decoding to get reliable logprobs
        output_scores=True,
        return_dict_in_generate=True
    )

# Calculate probability of "Yes" response
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()

print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")

Advanced Usage

# The model is trained on 8.0 FPS which we recommend for optimal inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "chancharikm/qwen2.5-vl-7b-cam-motion-preview",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "fps": 8.0,
            },
            {"type": "text", "text": "Describe the camera motion in this video."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📚 Documentation

Training and evaluation data

Training and evaluation data can be found in our repo.

Training procedure

We use the LLaMA - Factory codebase to finetune our model. Please use the above data and the hyperparameters below to replicate our work if desired.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 05
train_batch_size: 4
eval_batch_size: 1
seed: 42
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 256
total_eval_batch_size: 8
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon = 1e - 08 and optimizer_args = No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10.0

🔧 Technical Details

This model is a fine - tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the current most high - quality public camera motion dataset. It uses the LLaMA - Factory codebase for fine - tuning and specific hyperparameters for training.

📄 License

The license of this model is other.

✏️ Citation

If you find this repository useful for your research, please use the following.

@article{lin2025camerabench,
  title={Towards Understanding Camera Motions in Any Video},
  author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
  journal={arXiv preprint arXiv:2504.15376},
  year={2025},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご