VL3 - SigLIP - NaViT开源视觉编码器 - 动态处理不同分辨率图像视频

首页

VL3 SigLIP NaViT

由 DAMO-NLP-SG 开发

VideoLLaMA3的视觉编码器，采用任意分辨率视觉标记化(AVT)技术，动态处理不同分辨率的图像和视频。

文本生成图像

Transformers

英语开源协议:Apache-2.0 #任意分辨率视觉标记化 #多模态视频理解 #动态图像处理

下载量 25.55k

发布时间 : 1/21/2025

模型简介

本模型作为VideoLLaMA3的视觉编码器，采用2D-RoPE技术处理不同分辨率的图像和视频，为视觉标记注入更多信息。

模型特点

任意分辨率视觉标记化(AVT)

动态处理不同分辨率的图像和视频，通过2D-RoPE技术实现

多模态支持

能够处理图像和视频数据，为多模态大语言模型提供视觉特征

高性能视觉编码

在多个基准测试中表现优异，特别是在文档理解任务上

模型能力

图像特征提取

视频特征提取

多模态数据处理

高分辨率图像处理

使用案例

视觉问答

文档理解

解析和理解文档图像中的内容

在DocVQA验证集上达到31.32的准确率

图表理解

分析和解释图表图像中的信息

在ChartQA数据集上达到22.44的准确率

多模态大语言模型

VideoLLaMA3视觉编码

作为VideoLLaMA3的视觉前端，处理输入图像和视频

🚀 VideoLLaMA 3视觉编码器

VideoLLaMA 3视觉编码器是用于视频理解的前沿多模态基础模型，能够动态处理不同分辨率的图像和视频，为图像和视频表征提供更丰富的信息。

🚀 快速开始

import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "DAMO-NLP-SG/VL3-SigLIP-NaViT"
image_path = "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)

✨ 主要特性

该模型作为VideoLLaMA3中的视觉编码器。VideoLLaMA3利用任意分辨率视觉标记化（AVT）方法，动态处理不同分辨率的图像和视频。这是通过将预训练的视觉编码器（基于ViT架构）调整为使用2D - RoPE（旋转位置嵌入），取代ViT中传统使用的绝对位置嵌入来实现的。借助AVT，VideoLLaMA3能够在不同分辨率下更详细地表示图像和视频，为视觉标记赋予更多信息。为确保与AVT无缝集成，在视觉编码器适配阶段（VideoLLaMA3训练流程的阶段#1），我们使用场景图像、文档数据和带文本的场景图像对视觉编码器和投影器进行微调。在训练前，模型参数和架构从 SigLip 初始化。

📚 详细文档

模型性能

基础模型	GQA	AI2D	ChartQA	DocVQA_验证集	MME
clip - vit - large - patch14 - 336	61.50	56.28	18.32	24.86	1668.41
dfn5B - clip - vit - h - 14 - 378	62.70	56.87	16.40	23.09	1665.35
siglip - so400m - patch14 - 384 (我们的实现)	62.92	57.12	22.44	31.32	1667.92

更详细的分析可在我们的论文中找到。

📄 许可证

本项目采用 Apache 2.0 许可证。

引用

如果您发现VideoLLaMA对您的研究和应用有用，请使用以下BibTeX进行引用：

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}