VL3 - SigLIP - NaViT Open-Source Visual Encoder: Dynamically Process Images and Videos of Different Resolutions

VL3 SigLIP NaViT

Developed by DAMO-NLP-SG

The visual encoder for VideoLLaMA3, utilizing Arbitrary Resolution Visual Tokenization (AVT) technology to dynamically process images and videos of different resolutions.

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Arbitrary Resolution Visual Tokenization #Multimodal Video Understanding #Dynamic Image Processing

Downloads 25.55k

Release Time : 1/21/2025

Model Overview

This model serves as the visual encoder for VideoLLaMA3, employing 2D-RoPE technology to process images and videos of varying resolutions, enriching visual tokens with additional information.

Model Features

Arbitrary Resolution Visual Tokenization (AVT)

Dynamically processes images and videos of different resolutions through 2D-RoPE technology

Multimodal Support

Capable of handling image and video data, providing visual features for multimodal large language models

High-Performance Visual Encoding

Demonstrates excellent performance across multiple benchmarks, particularly in document understanding tasks

Model Capabilities

Image Feature Extraction

Video Feature Extraction

Multimodal Data Processing

High-Resolution Image Processing

Use Cases

Visual Question Answering

Document Understanding

Parsing and comprehending content within document images

Achieved 31.32 accuracy on the DocVQA validation set

Chart Understanding

Analyzing and interpreting information in chart images

Achieved 22.44 accuracy on the ChartQA dataset

Multimodal Large Language Models

VideoLLaMA3 Visual Encoding

Serves as the visual front-end for VideoLLaMA3, processing input images and videos

🚀 Visual Encoder for VideoLLaMA 3

This visual encoder is a key component of VideoLLaMA 3, a frontier multimodal foundation model for image and video understanding.

The visual encoder of VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding

If you like our project, please give us a star ⭐ on Github for the latest update.

🚀 Quick Start

import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "DAMO-NLP-SG/VL3-SigLIP-NaViT"
image_path = "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)

✨ Features

This model serves as the visual encoder in VideoLLaMA3.

VideoLLaMA3 leverages the Any-resolution Vision Tokenization (AVT) approach to dynamically process images and videos of varying resolutions. This is accomplished by adapting the pre-trained vision encoder (based on ViT architecture) to use 2D-RoPE (Rotary Position Embeddings), replacing the absolute position embeddings traditionally used in ViT.

With AVT, VideoLLaMA3 is able to represent images and videos with greater detail across different resolutions, enriching the vision tokens with more information. To ensure seamless integration with AVT, we fine-tune both the vision encoder and the projector during the Vision Encoder Adaptation stage (Stage #1 in the VideoLLaMA3 training pipeline) using scene images, document data, and scene images with text.

Before training, the model parameters and architecture are initialized from SigLip.

🚀 Model Performance

Base Model	GQA	AI2D	ChartQA	DocVQA_val	MME
clip-vit-large-patch14-336	61.50	56.28	18.32	24.86	1668.41
dfn5B-clip-vit-h-14-378	62.70	56.87	16.40	23.09	1665.35
siglip-so400m-patch14-384 (Our Implementation)	62.92	57.12	22.44	31.32	1667.92

A more detailed analysis can be found in our paper.

📚 Documentation

Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご