webssl-dino3b-heavy2b-224 Open-Source Vision Model - Empowering Precise Image Recognition and Analysis

Webssl Dino3b Heavy2b 224

Developed by facebook

A 3-billion parameter vision Transformer model trained on 2 billion carefully curated MetaCLIP data using DINOv2 self-supervised learning framework

Image Classification

Transformers

#3-billion parameter vision model #Language-agnostic self-supervised learning #Chart and document understanding

Downloads 26

Release Time : 4/25/2025

Model Overview

This is a vision Transformer model trained through self-supervised learning, specializing in image understanding tasks, particularly adept at processing charts and document images containing text

Model Features

Curated data training

Trained using only 1.3% of the original MetaCLIP dataset - a high-quality subset particularly containing charts, tables and document images with readable text

Self-supervised learning

Trained using DINOv2 framework, learning powerful visual representations without language supervision

Massive parameters

3-billion parameter vision Transformer architecture capable of capturing complex visual features

OCR-enhanced

Optimized for text and chart understanding, significantly improving OCR capabilities while maintaining performance on other vision tasks

Model Capabilities

Image feature extraction

Visual representation learning

Chart understanding

Document image analysis

OCR-related tasks

Use Cases

Document processing

Table recognition

Extracting table structures and contents from scanned documents

High-precision table recognition capability

Chart understanding

Analyzing chart images and extracting key information

Accurate chart content parsing

Computer vision

Image retrieval

Image search based on visual features

Efficient image similarity matching

Visual representation learning

Providing pretrained visual features for downstream tasks

Strong transfer learning capability

🚀 Web-SSL DINO ViT-3B: Heavy Filtered 2B MetaCLIP data, 224 Resolution

A 3 billion parameter Vision Transformer (ViT) trained with DINOv2 self - supervised learning on web - scale image data, offering strong performance in various vision tasks without language supervision.

🚀 Quick Start

To use the Web - SSL DINO 3B model, you can follow the code example below:

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-heavy2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

✨ Features

Self - supervised Learning: Trained using self - supervised learning on heavily filtered web images without language supervision.
Focused Filtering: The "heavy2b" filtering focuses on images with charts, tables, and documents, enhancing OCR & Chart understanding capabilities.
Strong Performance: Matches or exceeds the performance of language - supervised models like CLIP across various vision tasks.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-heavy2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

Advanced Usage

The README does not provide advanced usage examples, so this part is not shown.

📚 Documentation

Model Details

Property	Details
Architecture	ViT (3072 width, 26 depth, 24 heads)
Parameters	3B
Resolution	224×224 pixels
Training	Self - supervised Web - DINO on heavily filtered MetaCLIP data

Model Descriptions

Web - SSL DINO 3B is a 3 billion parameter Vision Transformer model trained using self - supervised learning on heavily filtered web images without language supervision. The "heavy2b" designation indicates training on a subset of images containing charts, tables, and documents with readable text, representing only 1.3% of the original MetaCLIP dataset. This focused filtering significantly improves OCR & Chart understanding capabilities while maintaining strong performance on other vision tasks. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

WebSSL Model Overview

🔧 Technical Details

The model is introduced in "Scaling Language - Free Visual Representation Learning" (Fan et al., 2025). It is a Vision Transformer (ViT) with 3 billion parameters, trained using DINOv2 self - supervised learning on heavily filtered web - scale image data without language supervision.

📄 License

The model is licensed under cc - by - nc - 4.0.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご