🚀 Siglip Vision Model Enhancement
This project is an enhanced version of https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2
, with specific improvements to the vision tower while keeping the text tower unchanged. The enhancements allow for better handling of variable resolution and aspect - ratio preserved images.
🚀 Quick Start
This model is an improved version of the model at https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2
, with the following two main changes:
- Increase the maximum resolution to 980 x 980 (instead of 384 x 384) by interpolating the position embeddings.
- Implement the strategy in NaViT to support a/ variable resolution images, b/ aspect ratio preserved images.
These changes are only applied to the vision tower, and there are no changes to the text tower. The implementation is fully backward compatible with https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2
; you just need to not specify the patch_attention_mask
.
💻 Usage Examples
Basic Usage
import torch
from modeling_siglip import SiglipVisionModel
DEVICE = torch.device("cuda:0")
PATCH_SIZE = 14
pixel_values = torch.randn(2, 3, 28, 42, dtype=torch.bfloat16, device=DEVICE)
pixel_attention_mask = [
[
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
],
[
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
],
]
pixel_attention_mask = torch.tensor(pixel_attention_mask, dtype=torch.bool, device=DEVICE)
patches_subgrid = pixel_attention_mask.unfold(
dimension=1, size=PATCH_SIZE, step=PATCH_SIZE
).unfold(dimension=2, size=PATCH_SIZE, step=PATCH_SIZE)
patch_attention_mask = (patches_subgrid.sum(dim=(-1, -2)) > 0).bool()
model = SiglipVisionModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit", _flash_attn_2_enabled=True)
model.train()
model.vision_model.to(DEVICE, dtype=torch.bfloat16)
output = model.vision_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
📄 License
This project is licensed under the Apache - 2.0 license.