siglip - so400m - 14 - 980 - flash - attn2 - navit Open - source Vision Model, Supporting Variable Resolution Image Processing

Siglip So400m 14 980 Flash Attn2 Navit

Developed by HuggingFaceM4

SigLIP-based vision model that enhances maximum resolution to 980x980 through interpolated positional embeddings and implements NaViT strategy for variable resolution and aspect ratio-preserving image processing

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Variable Resolution Vision #High-Resolution Image Processing #Aspect Ratio Preservation

Downloads 4,153

Release Time : 1/30/2024

Model Overview

This model is an improved version of the original SigLIP vision model, primarily enhancing image processing capabilities to support higher resolutions and more flexible input sizes while maintaining compatibility with the original model.

Model Features

High-Resolution Support

Increases maximum resolution from 384x384 to 980x980 through interpolated positional embeddings

NaViT Strategy Implementation

Supports variable resolution image processing and aspect ratio-preserving image input

Backward Compatibility

Fully compatible with the original SigLIP model, behaving identically when patch_attention_mask is not specified

Efficient Attention Mechanism

Utilizes Flash Attention 2 for efficient computation

Model Capabilities

High-Resolution Image Processing

Variable Resolution Image Feature Extraction

Aspect Ratio-Preserving Image Analysis

Visual Representation Learning

Use Cases

Computer Vision

High-Resolution Image Analysis

Feature extraction for high-resolution images (up to 980x980)

Obtains more detailed image feature representations

Variable-Size Image Processing

Processing images of different sizes and aspect ratios

Enables feature extraction without requiring uniform image sizes

Multimodal Learning

Vision-Language Alignment

Combines with text modules for image-text matching tasks

🚀 Siglip Vision Model Enhancement

This project is an enhanced version of https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2, with specific improvements to the vision tower while keeping the text tower unchanged. The enhancements allow for better handling of variable resolution and aspect - ratio preserved images.

🚀 Quick Start

This model is an improved version of the model at https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2, with the following two main changes:

Increase the maximum resolution to 980 x 980 (instead of 384 x 384) by interpolating the position embeddings.
Implement the strategy in NaViT to support a/ variable resolution images, b/ aspect ratio preserved images.

These changes are only applied to the vision tower, and there are no changes to the text tower. The implementation is fully backward compatible with https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2; you just need to not specify the patch_attention_mask.

💻 Usage Examples

Basic Usage

import torch
from modeling_siglip import SiglipVisionModel

DEVICE = torch.device("cuda:0")
PATCH_SIZE = 14

pixel_values = torch.randn(2, 3, 28, 42, dtype=torch.bfloat16, device=DEVICE)
pixel_attention_mask = [
    [
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,
        [1] * 14 + [1] * 14  + [1] * 14,

        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
        [0] * 14 + [0] * 14  + [0] * 14,
    ],
    [
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,

        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
        [1] * 14 + [1] * 14  + [0] * 14,
    ],
]
pixel_attention_mask = torch.tensor(pixel_attention_mask, dtype=torch.bool, device=DEVICE)
patches_subgrid = pixel_attention_mask.unfold(
    dimension=1, size=PATCH_SIZE, step=PATCH_SIZE
).unfold(dimension=2, size=PATCH_SIZE, step=PATCH_SIZE)
patch_attention_mask = (patches_subgrid.sum(dim=(-1, -2)) > 0).bool()

model = SiglipVisionModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit", _flash_attn_2_enabled=True)
model.train()
model.vision_model.to(DEVICE, dtype=torch.bfloat16)

output = model.vision_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご