đ Web-SSL DINO ViT-3B: Light Filtered 2B MetaCLIP data, 224 Resolution
This is a Vision Transformer (ViT) with 3 billion parameters. It is trained using DINOv2 self - supervised learning on lightly filtered web - scale image data without language supervision. It is introduced in "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).
đ Quick Start
The Web - SSL DINO ViT - 3B is a powerful vision model. You can start using it following the steps in the "Usage Examples" section.
⨠Features
- High - capacity Model: With 3 billion parameters, it can capture rich visual information.
- Self - supervised Learning: Trained without language supervision, relying on self - supervised learning on web - scale image data.
- Filtered Training Data: The "light2b" filtering improves OCR & Chart understanding capabilities while maintaining strong performance across all vision tasks.
đĻ Installation
There is no specific installation steps provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-light2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-light2b-224')
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
cls_features = outputs.last_hidden_state[:, 0]
patch_features = outputs.last_hidden_state[:, 1:]
Advanced Usage
There is no advanced usage code provided in the original document.
đ Documentation
Model Details
Property |
Details |
Architecture |
ViT (3072 width, 26 depth, 24 heads) |
Parameters |
3B |
Resolution |
224Ã224 pixels |
Training |
Self - supervised Web - DINO on lightly filtered MetaCLIP data |
Model Descriptions
Web - SSL DINO 3B is a Vision Transformer model with 3 billion parameters. It is trained using self - supervised learning on lightly filtered web images without language supervision. The "light2b" designation indicates training on a subset of images containing any textual content, retaining approximately 50.3% of the original MetaCLIP dataset. This filtering improves OCR & Chart understanding capabilities while maintaining strong performance across all vision tasks. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

đ License
The model is released under the cc - by - nc - 4.0 license.
đ§ Technical Details
The model is introduced in the paper "Scaling Language - Free Visual Representation Learning" (Fan et al., 2025). It shows the effectiveness of self - supervised learning on web - scale image data for vision tasks without language supervision.
đ Citation
@article{fan2025scaling,
title={Scaling Language-Free Visual Representation Learning},
author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
year={2025},
eprint={2504.01017},
archivePrefix={arXiv},
primaryClass={cs.CV}
}