đ Web-SSL DINO ViT-1B: 2B MetaCLIP data, 224 Resolution
A 1 billion parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on web-scale image data without language supervision.
đ Quick Start
Web-SSL DINO 1B is a powerful Vision Transformer model. It's trained using self - supervised learning on 2 billion web images without language supervision. This model shows that proper - scaled pure visual learning can match or outperform language - supervised models like CLIP in various vision tasks.
⨠Features
- High - capacity Model: With 1 billion parameters, it can capture complex visual patterns.
- Self - supervised Learning: Trained without language supervision, relying on self - supervised Web - DINO on a large web image dataset.
- Versatile Performance: Can match or exceed the performance of language - supervised models in multiple vision tasks.
đĻ Installation
There is no specific installation steps provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino1b-full2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino1b-full2b-224')
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
cls_features = outputs.last_hidden_state[:, 0]
patch_features = outputs.last_hidden_state[:, 1:]
đ Documentation
Model Details
Property |
Details |
Model Type |
Vision Transformer (ViT) |
Architecture |
ViT (1536 width, 40 depth, 24 heads) |
Parameters |
1B |
Resolution |
224Ã224 pixels |
Training Data |
Self - supervised Web - DINO on 2B image samples from MetaCLIP web data |
Model Descriptions
Web-SSL DINO 1B is a 1 billion parameter Vision Transformer model trained using self - supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

đ License
The model is released under the cc - by - nc - 4.0
license.
đ Citation
@article{fan2025scaling,
title={Scaling Language-Free Visual Representation Learning},
author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
year={2025},
eprint={2504.01017},
archivePrefix={arXiv},
primaryClass={cs.CV}
}