WebSSL - MAE700M - Full2B - 224 Open-Source Vision Model: Process Images without Language Supervision and Widely Used in Applications

Webssl Mae700m Full2b 224

Developed by facebook

This is a 700M-parameter Vision Transformer model trained on 2 billion web images using masked autoencoder self-supervised learning, without language supervision.

Image Classification

Transformers

#Unsupervised Visual Learning #700M Parameter Large Model #Chart Understanding Optimization

Downloads 15

Release Time : 4/25/2025

Model Overview

Web-SSL MAE ViT-H is a large-scale visual representation learning model based on the Vision Transformer architecture, trained through self-supervised learning on massive web image data, suitable for various visual tasks.

Model Features

Large-scale Self-supervised Learning

Trained on 2 billion MetaCLIP web data without language supervision

High-performance Visual Representation

Excels in various visual tasks, particularly in OCR and chart understanding

Pure Visual Learning

Demonstrates that pure visual learning can match or surpass language-supervised models when properly scaled

Model Capabilities

Image Feature Extraction

Visual Representation Learning

OCR Recognition

Chart Understanding

Use Cases

Document Processing

OCR Text Recognition

Extract text content from images

Performs excellently in OCR tasks

Data Visualization

Chart Understanding

Analyze and interpret chart content

Outstanding performance in chart understanding tasks

General Visual Tasks

Image Classification

Classify image content

Remains competitive in traditional visual benchmarks

🚀 Web-SSL MAE ViT-H (700M): 2B MetaCLIP data, 224 Resolution

This is a Vision Transformer (ViT-H) with 700 million parameters. It is trained using Masked Autoencoder (MAE) self - supervised learning on web - scale image data without language supervision. It was introduced in "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).

🚀 Quick Start

Web-SSL MAE ViT-H is a powerful vision model. You can use the following Python code to quickly start using it.

from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image

# Adjust the size, crop_size, etc. fields to your liking
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
    outputs = model(**inputs)

# Extract features from the encoder
encoder_hidden_states = outputs.last_hidden_state

✨ Features

No Language Supervision: Trained on web - scale image data without language supervision, demonstrating that pure visual learning can achieve excellent performance.
Strong Performance: Performs particularly well on OCR & Chart understanding tasks and maintains competitive performance across traditional vision benchmarks and multimodal tasks.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, ViTModel
import torch
from PIL import Image

# Adjust the size, crop_size, etc. fields to your liking
processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae700m-full2b-224')
model = ViTModel.from_pretrained('facebook/webssl-mae700m-full2b-224').cuda().eval()

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt").to('cuda')
with torch.no_grad():
    outputs = model(**inputs)

# Extract features from the encoder
encoder_hidden_states = outputs.last_hidden_state

Advanced Usage

The README does not provide advanced usage examples, so this part is not added.

📚 Documentation

Model Details

Property	Details
Model Type	ViT - H (Huge)
Parameters	700M
Resolution	224×224 pixels
Training Data	Self - supervised Web - MAE on 2B image samples from MetaCLIP web data

Model Descriptions

Web-SSL MAE ViT-H is a 700 million parameter Vision Transformer model trained using masked autoencoder self - supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks. Web - MAE exhibits particularly strong performance on OCR & Chart understanding tasks while maintaining competitive performance across traditional vision benchmarks and multimodal tasks.

WebSSL Model Overview

🔧 Technical Details

The README does not provide in - depth technical details, so this section is skipped.

📄 License

The model is licensed under cc - by - nc - 4.0.

📚 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご