so400m-long Open-source Vision-Language Model - Fine-tuned based on SigLIP 2, with improved long-text processing capabilities

Home

So400m Long

Developed by fancyfeast

A vision-language model fine-tuned based on SigLIP 2, with maximum text length increased from 64 to 256 tokens

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Long-text visual matching #Multimodal embedding #Gallery tag enhancement

Downloads 27

Release Time : 4/14/2025

Model Overview

This model is a fine-tuned version of SigLIP 2, focusing on extending context length and text type adaptation while preserving the original embedding space features and improving long-text processing capabilities

Model Features

Extended Context Length

Maximum text length increased from 64 tokens in the base model to 256 tokens

Preserved Original Features

Key components like the visual encoder tower are frozen to ensure the original embedding space features are retained

Multi-type Text Adaptation

Training data includes various image-text combinations such as descriptive captions, gallery tags, and prompts

Model Capabilities

Image-text matching

Cross-modal retrieval

Short-text preference recognition

Multi-type text processing

Use Cases

Content Retrieval

Gallery Tag Matching

Match relevant tag lists based on image content

Recognition capability for realistic images still has room for improvement

Multimodal Applications

Image-Text Pair Generation

Generate descriptive text or prompts for images

Tends to generate shorter text descriptions

🚀 Finetune of SigLIP 2 So400m for Long Context

This model is finetuned from SigLIP 2, functioning the same as the base model but with an extended maximum text length of 256 tokens (compared to 64 in the base model).

🚀 Quick Start

Finetuned from SigLIP 2, this model functions exactly the same except it now has a maximum text length of 256 tokens, compared to 64 in the base model.

✨ Features

Training Settings

Training Samples: 10,000,000
Warmup Samples: 1,000,000
Batch Size: 256
Learning Rate: 4e - 4
Schedule: Cosine
AMP: bfloat16
Model Weights: float32
Optimizer: AdamW
Weight Decay: 0.2
Clip Grad Norm: 1.0
Maximum Token Length: 256

These settings are by no means optimal. The SigLIP paper suggests that Weight Decay is bad for finetuning SigLIP models, and of course these types of models tend to benefit from large batch sizes. I merely used some defaults from older code.

Performance on Test Set

On a test set of 16K samples, the model starts at a loss of 17.65 and finishes at a loss of 2.51.

Dataset

The dataset used consists of about 1.2 M text - image pairs with data from a variety of sources. About 250k examples are random CommonCrawl image - alt text pairs, which should best match so400m's original training data. The remainder of the examples are from the JoyCaption dataset, which contains a wide variety of image types and paired text such as descriptive captions, booru tag lists, stable diffusion prompts, and VQA.

Training Strategy

During training the vision tower was kept completely frozen, along with logit_scale, logit_bias, and the text tower's head. The rest of the text tower was left unfrozen. This is to help ensure that the finetuning process preserves the original embedding space, and focusses on merely upgrading the context length and types of text.

Position Embeddings

The position embeddings were expanded by leaving the original 64 embeddings intact in their original positions, while initializing the new positions randomly. No ablations were perform to determine if this is the optimial approach. However I noted during experimentation that the model is fairly insensitive to the position embeddings.

Practical Performance

In practice I've found that this model performs slightly better than the base SigLIP 2 so400m, but tends to prefer shorter text. i.e. given two texts that both perfectly describe the image, the model will tend to weight the shorter of the two higher. The model's ability to recognize booru tag lists for photorealistic images is also imperfect.

📚 Documentation

Credits

Credits to the SigLIP 2 team for their amazing work on improving an already great model.

BibTeX entry and citation info

@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}

📄 License

This project is licensed under the apache-2.0 license.

Property	Details
Library Name	transformers
Tags	vision
License	apache - 2.0
Base Model	google/siglip2 - so400m - patch14 - 384

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご