đ Web-SSL MAE ViT-H (700M): 2B MetaCLIP data, 224 Resolution
This is a Vision Transformer (ViT-H) with 700 million parameters. It is trained using Masked Autoencoder (MAE) self - supervised learning on web - scale image data without language supervision. It was introduced in "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).
đ Quick Start
Web-SSL MAE ViT-H is a powerful vision model. You can use the following Python code to quickly start using it.
from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
outputs = model(**inputs)
encoder_hidden_states = outputs.last_hidden_state
⨠Features
- No Language Supervision: Trained on web - scale image data without language supervision, demonstrating that pure visual learning can achieve excellent performance.
- Strong Performance: Performs particularly well on OCR & Chart understanding tasks and maintains competitive performance across traditional vision benchmarks and multimodal tasks.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
outputs = model(**inputs)
encoder_hidden_states = outputs.last_hidden_state
Advanced Usage
The README does not provide advanced usage examples, so this part is not added.
đ Documentation
Model Details
Property |
Details |
Model Type |
ViT - H (Huge) |
Parameters |
700M |
Resolution |
224Ã224 pixels |
Training Data |
Self - supervised Web - MAE on 2B image samples from MetaCLIP web data |
Model Descriptions
Web-SSL MAE ViT-H is a 700 million parameter Vision Transformer model trained using masked autoencoder self - supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks. Web - MAE exhibits particularly strong performance on OCR & Chart understanding tasks while maintaining competitive performance across traditional vision benchmarks and multimodal tasks.

đ§ Technical Details
The README does not provide in - depth technical details, so this section is skipped.
đ License
The model is licensed under cc - by - nc - 4.0.
đ Citation
@article{fan2025scaling,
title={Scaling Language-Free Visual Representation Learning},
author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
year={2025},
eprint={2504.01017},
archivePrefix={arXiv},
primaryClass={cs.CV}
}