Open-source visual model webssl-dino2b-heavy2b-224 - Free deployment, optimize chart and text understanding capabilities

Webssl Dino2b Heavy2b 224

Developed by facebook

A 2-billion parameter self-supervised vision Transformer model trained on rigorously filtered web-scale image data, specially optimized for chart and text understanding

Image Classification

Transformers

#2-billion parameter vision model #self-supervised learning #chart and text understanding

Downloads 24

Release Time : 4/25/2025

Model Overview

This is a vision Transformer model trained via self-supervised learning on carefully filtered web-scale image data, specifically optimized for charts, tables, and readable text documents, demonstrating excellent performance in OCR and chart understanding tasks

Model Features

Rigorously filtered training data

Trained on a high-quality subset comprising only 1.3% of the original MetaCLIP dataset, specifically including charts, tables, and readable text documents

Self-supervised learning

Utilizes DINOv2 self-supervised learning approach, capable of learning powerful visual representations without language supervision

Large-scale parameters

2-billion parameter vision Transformer architecture providing powerful feature extraction capabilities

Optimized OCR capabilities

Specially optimized for text and chart understanding, showing outstanding performance in related tasks

Model Capabilities

Image feature extraction

Visual representation learning

Chart understanding

Text detection

Table recognition

Use Cases

Document processing

Table recognition

Extracting table structure and content from images

High-precision table detection and recognition

OCR enhancement

Improving text recognition accuracy in images

Improved text recognition performance in complex backgrounds

Visual understanding

Chart analysis

Understanding various chart types and data in images

Accurate chart classification and data extraction

🚀 Web-SSL DINO ViT-2B: Heavy Filtered 2B MetaCLIP data, 224 Resolution

This is a 2-billion-parameter Vision Transformer (ViT) model. It is trained with DINOv2 self-supervised learning on heavily filtered web-scale image data, without any language supervision. This model is introduced in the paper "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).

✨ Features

Trained using self - supervised learning on heavily filtered web images without language supervision.
The "heavy2b" subset focuses on images with charts, tables, and documents with readable text, improving OCR & Chart understanding capabilities.
Demonstrates that pure visual learning can match or exceed the performance of language - supervised models like CLIP on various vision tasks.

📦 Installation

The installation process is not provided in the original README. However, if you want to use this model, you need to have the transformers library installed. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino2b-heavy2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino2b-heavy2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Architecture	ViT (2688 width, 24 depth, 21 heads)
Parameters	2B
Resolution	224×224 pixels
Training	Self - supervised Web - DINO on heavily filtered MetaCLIP data

Model Descriptions

Web - SSL DINO 2B is a 2 - billion - parameter Vision Transformer model. It is trained using self - supervised learning on heavily filtered web images without language supervision. The "heavy2b" designation means it is trained on a subset of images that explicitly contain charts, tables, and documents with readable text, which only accounts for 1.3% of the original MetaCLIP dataset. This focused filtering significantly enhances OCR & Chart understanding capabilities while maintaining strong performance on other vision tasks.

WebSSL Model Overview

📄 License

This model is licensed under cc-by-nc-4.0.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご