WebSSL - MAE1B - Full2B - 224 Open-source Visual Model - Language-free Supervised Learning of Visual Representations for Image Analysis

Webssl Mae1b Full2b 224

Developed by facebook

A 1-billion-parameter Vision Transformer model trained via masked autoencoder self-supervised learning on 2 billion web images, capable of learning visual representations without language supervision.

Image Classification

Transformers

#1B Parameter Vision Transformer #Language-Free Self-Supervised Learning #Trained on 2B Images

Downloads 36

Release Time : 4/25/2025

Model Overview

This model demonstrates that pure visual learning methods can match or surpass language-supervised models in various vision tasks, particularly excelling in OCR and chart understanding tasks.

Model Features

Large-Scale Self-Supervised Learning

Trained on 2 billion web images without any language supervision

Efficient Visual Representation

Outperforms language-supervised models in tasks like OCR and chart understanding

Pure Visual Architecture

Utilizes ViT architecture focused on visual information processing

Model Capabilities

Image Feature Extraction

Visual Representation Learning

OCR Task Processing

Chart Understanding

Use Cases

Document Processing

Optical Character Recognition (OCR)

Extract text information from images

Superior recognition accuracy compared to language-supervised models

Data Visualization

Chart Understanding

Parse data and relationships in charts

Demonstrates outstanding comprehension capabilities

🚀 Web-SSL MAE ViT-1B: 2B MetaCLIP data, 224 Resolution

This is a 1 billion parameter Vision Transformer (ViT) trained with Masked Autoencoder (MAE) self - supervised learning on web - scale image data without language supervision. It was introduced in "Scaling Language - Free Visual Representation Learning" (Fan et al., 2025).

🚀 Quick Start

The Web - SSL MAE 1B model is a powerful tool for vision tasks. You can start using it by following the steps in the "Usage Examples" section.

✨ Features

Trained on 2 billion web images without language supervision, demonstrating the power of pure visual learning.
Can match or exceed the performance of language - supervised models like CLIP across various vision tasks.
Shows particularly strong performance on OCR & Chart understanding tasks and maintains competitive performance in traditional vision benchmarks and multimodal tasks.

📦 Installation

The installation process mainly involves installing the transformers library. You can use the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image

# Adjust the size, crop_size, etc. fields to your liking
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae1b-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae1b-full2b-224').cuda().eval()

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
    outputs = model(**inputs)

# Extract features from the encoder
encoder_hidden_states = outputs.last_hidden_state

📚 Documentation

Model Details

Property	Details
Architecture	ViT (1536 width, 40 depth, 24 heads)
Parameters	1B
Resolution	224×224 pixels
Training	Self - supervised Web - MAE on 2B image samples from MetaCLIP web data

Model Descriptions

Web - SSL MAE 1B is a 1 billion parameter Vision Transformer model. It is trained using masked autoencoder self - supervised learning on 2 billion web images without language supervision. This model shows that when pure visual learning is scaled appropriately, it can achieve comparable or better performance than language - supervised models like CLIP in various vision tasks. Web - MAE performs especially well in OCR & Chart understanding tasks and remains competitive in traditional vision benchmarks and multimodal tasks.

WebSSL Model Overview

📄 License

This project is licensed under the cc - by - nc - 4.0 license.

📚 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご