VL3 - SigLIP - NaViT開源視覺編碼器 - 動態處理不同分辨率圖像視頻

首頁

VL3 SigLIP NaViT

由DAMO-NLP-SG開發

VideoLLaMA3的視覺編碼器，採用任意分辨率視覺標記化(AVT)技術，動態處理不同分辨率的圖像和視頻。

文本生成圖像

Transformers

英語開源協議:Apache-2.0 #任意分辨率視覺標記化 #多模態視頻理解 #動態圖像處理

下載量 25.55k

發布時間 : 1/21/2025

模型概述

本模型作為VideoLLaMA3的視覺編碼器，採用2D-RoPE技術處理不同分辨率的圖像和視頻，為視覺標記注入更多信息。

模型特點

任意分辨率視覺標記化(AVT)

動態處理不同分辨率的圖像和視頻，通過2D-RoPE技術實現

多模態支持

能夠處理圖像和視頻數據，為多模態大語言模型提供視覺特徵

高性能視覺編碼

在多個基準測試中表現優異，特別是在文檔理解任務上

模型能力

圖像特徵提取

視頻特徵提取

多模態數據處理

高分辨率圖像處理

使用案例

視覺問答

文檔理解

解析和理解文檔圖像中的內容

在DocVQA驗證集上達到31.32的準確率

圖表理解

分析和解釋圖表圖像中的信息

在ChartQA數據集上達到22.44的準確率

多模態大語言模型

VideoLLaMA3視覺編碼

作為VideoLLaMA3的視覺前端，處理輸入圖像和視頻

🚀 VideoLLaMA 3視覺編碼器

VideoLLaMA 3視覺編碼器是用於視頻理解的前沿多模態基礎模型，能夠動態處理不同分辨率的圖像和視頻，為圖像和視頻表徵提供更豐富的信息。

🚀 快速開始

import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "DAMO-NLP-SG/VL3-SigLIP-NaViT"
image_path = "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)

✨ 主要特性

該模型作為VideoLLaMA3中的視覺編碼器。VideoLLaMA3利用任意分辨率視覺標記化（AVT）方法，動態處理不同分辨率的圖像和視頻。這是通過將預訓練的視覺編碼器（基於ViT架構）調整為使用2D - RoPE（旋轉位置嵌入），取代ViT中傳統使用的絕對位置嵌入來實現的。藉助AVT，VideoLLaMA3能夠在不同分辨率下更詳細地表示圖像和視頻，為視覺標記賦予更多信息。為確保與AVT無縫集成，在視覺編碼器適配階段（VideoLLaMA3訓練流程的階段#1），我們使用場景圖像、文檔數據和帶文本的場景圖像對視覺編碼器和投影器進行微調。在訓練前，模型參數和架構從 SigLip 初始化。

📚 詳細文檔

模型性能

基礎模型	GQA	AI2D	ChartQA	DocVQA_驗證集	MME
clip - vit - large - patch14 - 336	61.50	56.28	18.32	24.86	1668.41
dfn5B - clip - vit - h - 14 - 378	62.70	56.87	16.40	23.09	1665.35
siglip - so400m - patch14 - 384 (我們的實現)	62.92	57.12	22.44	31.32	1667.92

更詳細的分析可在我們的論文中找到。

📄 許可證

本項目採用 Apache 2.0 許可證。

引用

如果您發現VideoLLaMA對您的研究和應用有用，請使用以下BibTeX進行引用：

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}