🚀 VideoLLaMA 3視覺編碼器
VideoLLaMA 3視覺編碼器是用於視頻理解的前沿多模態基礎模型,能夠動態處理不同分辨率的圖像和視頻,為圖像和視頻表徵提供更豐富的信息。
🚀 快速開始
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image
model_name = "DAMO-NLP-SG/VL3-SigLIP-NaViT"
image_path = "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"
images = load_image(image_path)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
✨ 主要特性
該模型作為VideoLLaMA3中的視覺編碼器。VideoLLaMA3利用任意分辨率視覺標記化(AVT)方法,動態處理不同分辨率的圖像和視頻。這是通過將預訓練的視覺編碼器(基於ViT架構)調整為使用2D - RoPE(旋轉位置嵌入),取代ViT中傳統使用的絕對位置嵌入來實現的。藉助AVT,VideoLLaMA3能夠在不同分辨率下更詳細地表示圖像和視頻,為視覺標記賦予更多信息。為確保與AVT無縫集成,在視覺編碼器適配階段(VideoLLaMA3訓練流程的階段#1),我們使用場景圖像、文檔數據和帶文本的場景圖像對視覺編碼器和投影器進行微調。在訓練前,模型參數和架構從 SigLip 初始化。
📚 詳細文檔
模型性能
基礎模型 |
GQA |
AI2D |
ChartQA |
DocVQA驗證集 |
MME |
clip - vit - large - patch14 - 336 |
61.50 |
56.28 |
18.32 |
24.86 |
1668.41 |
dfn5B - clip - vit - h - 14 - 378 |
62.70 |
56.87 |
16.40 |
23.09 |
1665.35 |
siglip - so400m - patch14 - 384 (我們的實現) |
62.92 |
57.12 |
22.44 |
31.32 |
1667.92 |
📄 許可證
本項目採用 Apache 2.0 許可證。
引用
如果您發現VideoLLaMA對您的研究和應用有用,請使用以下BibTeX進行引用:
@article{damonlpsg2025videollama3,
title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
journal={arXiv preprint arXiv:2501.13106},
year={2025},
url = {https://arxiv.org/abs/2501.13106}
}
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}
如果您喜歡我們的項目,請在 Github 上給我們一個星 ⭐ 以獲取最新更新。