🚀 VideoLLaMA 3视觉编码器
VideoLLaMA 3视觉编码器是用于视频理解的前沿多模态基础模型,能够动态处理不同分辨率的图像和视频,为图像和视频表征提供更丰富的信息。
🚀 快速开始
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image
model_name = "DAMO-NLP-SG/VL3-SigLIP-NaViT"
image_path = "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"
images = load_image(image_path)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
✨ 主要特性
该模型作为VideoLLaMA3中的视觉编码器。VideoLLaMA3利用任意分辨率视觉标记化(AVT)方法,动态处理不同分辨率的图像和视频。这是通过将预训练的视觉编码器(基于ViT架构)调整为使用2D - RoPE(旋转位置嵌入),取代ViT中传统使用的绝对位置嵌入来实现的。借助AVT,VideoLLaMA3能够在不同分辨率下更详细地表示图像和视频,为视觉标记赋予更多信息。为确保与AVT无缝集成,在视觉编码器适配阶段(VideoLLaMA3训练流程的阶段#1),我们使用场景图像、文档数据和带文本的场景图像对视觉编码器和投影器进行微调。在训练前,模型参数和架构从 SigLip 初始化。
📚 详细文档
模型性能
基础模型 |
GQA |
AI2D |
ChartQA |
DocVQA验证集 |
MME |
clip - vit - large - patch14 - 336 |
61.50 |
56.28 |
18.32 |
24.86 |
1668.41 |
dfn5B - clip - vit - h - 14 - 378 |
62.70 |
56.87 |
16.40 |
23.09 |
1665.35 |
siglip - so400m - patch14 - 384 (我们的实现) |
62.92 |
57.12 |
22.44 |
31.32 |
1667.92 |
📄 许可证
本项目采用 Apache 2.0 许可证。
引用
如果您发现VideoLLaMA对您的研究和应用有用,请使用以下BibTeX进行引用:
@article{damonlpsg2025videollama3,
title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
journal={arXiv preprint arXiv:2501.13106},
year={2025},
url = {https://arxiv.org/abs/2501.13106}
}
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}
如果您喜欢我们的项目,请在 Github 上给我们一个星 ⭐ 以获取最新更新。