🚀 Siglip视觉模型改进版
本项目是对https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2
的改进,主要解决了原模型在图像分辨率和处理不同尺寸图像方面的局限性,提升了模型在视觉处理上的灵活性和适用性。
🚀 快速开始
本模型是https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2
的改进版本,主要有以下两处改动:
- 通过对位置嵌入进行插值,将最大分辨率提高到 980 x 980(原模型为 384 x 384)。
- 采用了 NaViT 中的策略,以支持 a/ 可变分辨率的图像,b/ 保持宽高比的图像。
这些改动仅应用于视觉塔,文本塔未做任何修改。该实现与 https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2
完全向后兼容,只需不指定 patch_attention_mask
即可。
💻 使用示例
基础用法
import torch
from modeling_siglip import SiglipVisionModel
DEVICE = torch.device("cuda:0")
PATCH_SIZE = 14
pixel_values = torch.randn(2, 3, 28, 42, dtype=torch.bfloat16, device=DEVICE)
pixel_attention_mask = [
[
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[1] * 14 + [1] * 14 + [1] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
[0] * 14 + [0] * 14 + [0] * 14,
],
[
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
[1] * 14 + [1] * 14 + [0] * 14,
],
]
pixel_attention_mask = torch.tensor(pixel_attention_mask, dtype=torch.bool, device=DEVICE)
patches_subgrid = pixel_attention_mask.unfold(
dimension=1, size=PATCH_SIZE, step=PATCH_SIZE
).unfold(dimension=2, size=PATCH_SIZE, step=PATCH_SIZE)
patch_attention_mask = (patches_subgrid.sum(dim=(-1, -2)) > 0).bool()
model = SiglipVisionModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit", _flash_attn_2_enabled=True)
model.train()
model.vision_model.to(DEVICE, dtype=torch.bfloat16)
output = model.vision_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
📄 许可证
本项目采用 Apache-2.0 许可证。