đ Qwen2.5-VL-7B Camera Motion Model
This model is a fine - tuned version of Qwen/Qwen2.5-VL-7B-Instruct. It is trained on the most high - quality public camera motion dataset. It achieves the current SOTA for camera motion classification and video - text retrieval with camera motion captions using VQAScore.
đ Quick Start
The usage of this model is the same as a Qwen2.5-VL model. It is mainly useful for camera motion classification in videos and video - text retrieval, where it currently holds the SOTA position in both tasks.
⨠Features
- Fine - tuned on high - quality public camera motion dataset.
- Current SOTA for camera motion classification and video - text retrieval with camera motion captions.
đĻ Installation
The installation process is related to the transformers
library. You can install it via the following command:
pip install transformers
đģ Usage Examples
Basic Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
video_path = "file:///path/to/video1.mp4"
text_description = "the camera tilting upward"
question = f"Does this video show \"{text_description}\"?"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": video_path,
"fps": 8.0,
},
{"type": "text", "text": question},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs
)
inputs = inputs.to("cuda")
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=1,
do_sample=False,
output_scores=True,
return_dict_in_generate=True
)
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("Yes")[0]
score = probs[0, yes_token_id].item()
print(f"Video: {video_path}")
print(f"Description: '{text_description}'")
print(f"Score: {score:.4f}")
Advanced Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"fps": 8.0,
},
{"type": "text", "text": "Describe the camera motion in this video."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
fps=fps,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
đ Documentation
Training and evaluation data
Training and evaluation data can be found in our repo.
Training procedure
We use the LLaMA - Factory codebase to finetune our model. Please use the above data and the hyperparameters below to replicate our work if desired.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e - 05
- train_batch_size: 4
- eval_batch_size: 1
- seed: 42
- distributed_type: multi - GPU
- num_devices: 8
- gradient_accumulation_steps: 8
- total_train_batch_size: 256
- total_eval_batch_size: 8
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon = 1e - 08 and optimizer_args = No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10.0
đ§ Technical Details
This model is a fine - tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the current most high - quality public camera motion dataset. It uses the LLaMA - Factory codebase for fine - tuning and specific hyperparameters for training.
đ License
The license of this model is other
.
âī¸ Citation
If you find this repository useful for your research, please use the following.
@article{lin2025camerabench,
title={Towards Understanding Camera Motions in Any Video},
author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
journal={arXiv preprint arXiv:2504.15376},
year={2025},
}