đ Web-SSL DINO ViT-3B: Heavy Filtered 2B MetaCLIP data, 224 Resolution
A 3 billion parameter Vision Transformer (ViT) trained with DINOv2 self - supervised learning on web - scale image data, offering strong performance in various vision tasks without language supervision.
đ Quick Start
To use the Web - SSL DINO 3B model, you can follow the code example below:
from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
cls_features = outputs.last_hidden_state[:, 0]
patch_features = outputs.last_hidden_state[:, 1:]
⨠Features
- Self - supervised Learning: Trained using self - supervised learning on heavily filtered web images without language supervision.
- Focused Filtering: The "heavy2b" filtering focuses on images with charts, tables, and documents, enhancing OCR & Chart understanding capabilities.
- Strong Performance: Matches or exceeds the performance of language - supervised models like CLIP across various vision tasks.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
cls_features = outputs.last_hidden_state[:, 0]
patch_features = outputs.last_hidden_state[:, 1:]
Advanced Usage
The README does not provide advanced usage examples, so this part is not shown.
đ Documentation
Model Details
Property |
Details |
Architecture |
ViT (3072 width, 26 depth, 24 heads) |
Parameters |
3B |
Resolution |
224Ã224 pixels |
Training |
Self - supervised Web - DINO on heavily filtered MetaCLIP data |
Model Descriptions
Web - SSL DINO 3B is a 3 billion parameter Vision Transformer model trained using self - supervised learning on heavily filtered web images without language supervision. The "heavy2b" designation indicates training on a subset of images containing charts, tables, and documents with readable text, representing only 1.3% of the original MetaCLIP dataset. This focused filtering significantly improves OCR & Chart understanding capabilities while maintaining strong performance on other vision tasks. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

đ§ Technical Details
The model is introduced in "Scaling Language - Free Visual Representation Learning" (Fan et al., 2025). It is a Vision Transformer (ViT) with 3 billion parameters, trained using DINOv2 self - supervised learning on heavily filtered web - scale image data without language supervision.
đ License
The model is licensed under cc - by - nc - 4.0.
đ Citation
@article{fan2025scaling,
title={Scaling Language-Free Visual Representation Learning},
author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
year={2025},
eprint={2504.01017},
archivePrefix={arXiv},
primaryClass={cs.CV}
}