đ Web-SSL MAE ViT-1B: 2B MetaCLIP data, 224 Resolution
This is a 1 billion parameter Vision Transformer (ViT) trained with Masked Autoencoder (MAE) self - supervised learning on web - scale image data without language supervision. It was introduced in "Scaling Language - Free Visual Representation Learning" (Fan et al., 2025).
đ Quick Start
The Web - SSL MAE 1B model is a powerful tool for vision tasks. You can start using it by following the steps in the "Usage Examples" section.
⨠Features
- Trained on 2 billion web images without language supervision, demonstrating the power of pure visual learning.
- Can match or exceed the performance of language - supervised models like CLIP across various vision tasks.
- Shows particularly strong performance on OCR & Chart understanding tasks and maintains competitive performance in traditional vision benchmarks and multimodal tasks.
đĻ Installation
The installation process mainly involves installing the transformers
library. You can use the following command:
pip install transformers
đģ Usage Examples
Basic Usage
from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae1b-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae1b-full2b-224').cuda().eval()
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
outputs = model(**inputs)
encoder_hidden_states = outputs.last_hidden_state
đ Documentation
Model Details
Property |
Details |
Architecture |
ViT (1536 width, 40 depth, 24 heads) |
Parameters |
1B |
Resolution |
224Ã224 pixels |
Training |
Self - supervised Web - MAE on 2B image samples from MetaCLIP web data |
Model Descriptions
Web - SSL MAE 1B is a 1 billion parameter Vision Transformer model. It is trained using masked autoencoder self - supervised learning on 2 billion web images without language supervision. This model shows that when pure visual learning is scaled appropriately, it can achieve comparable or better performance than language - supervised models like CLIP in various vision tasks. Web - MAE performs especially well in OCR & Chart understanding tasks and remains competitive in traditional vision benchmarks and multimodal tasks.

đ License
This project is licensed under the cc - by - nc - 4.0 license.
đ Citation
@article{fan2025scaling,
title={Scaling Language-Free Visual Representation Learning},
author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
year={2025},
eprint={2504.01017},
archivePrefix={arXiv},
primaryClass={cs.CV}
}