webssl-dino3b-light2b-224 Open-Source Vision Model - Free Deployment for Language-Unsupervised Image Recognition

Webssl Dino3b Light2b 224

Developed by facebook

A 3-billion parameter Vision Transformer model trained using DINOv2 self-supervised learning on lightly filtered web-scale image data, without language supervision.

Image Classification

Transformers

#Self-supervised visual representation #3B parameter ViT #Lightly filtered data training

Downloads 25

Release Time : 4/25/2025

Model Overview

This is a Vision Transformer model based on ViT architecture, trained via self-supervised learning on lightly filtered web images, focusing on pure visual representation learning without language supervision.

Model Features

Self-supervised learning

Uses DINOv2 self-supervised learning method to learn effective visual representations without language supervision.

Lightly filtered data training

Trained on lightly filtered MetaCLIP dataset retaining approximately 50.3% of original samples, enhancing OCR and chart understanding capabilities.

Large-scale parameters

Features a massive 3-billion parameter Vision Transformer architecture capable of capturing richer visual features.

Model Capabilities

Image feature extraction

Visual representation learning

Enhanced OCR capability

Chart understanding

Use Cases

Computer vision

Image classification

Can be used for image classification tasks to extract effective visual features.

Object detection

Can serve as a base feature extractor for object detection tasks.

Document analysis

OCR enhancement

Due to the lightly filtered nature of training data, the model performs exceptionally well on OCR-related tasks.

🚀 Web-SSL DINO ViT-3B: Light Filtered 2B MetaCLIP data, 224 Resolution

This is a Vision Transformer (ViT) with 3 billion parameters. It is trained using DINOv2 self - supervised learning on lightly filtered web - scale image data without language supervision. It is introduced in "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).

🚀 Quick Start

The Web - SSL DINO ViT - 3B is a powerful vision model. You can start using it following the steps in the "Usage Examples" section.

✨ Features

High - capacity Model: With 3 billion parameters, it can capture rich visual information.
Self - supervised Learning: Trained without language supervision, relying on self - supervised learning on web - scale image data.
Filtered Training Data: The "light2b" filtering improves OCR & Chart understanding capabilities while maintaining strong performance across all vision tasks.

📦 Installation

There is no specific installation steps provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-light2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-light2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

Advanced Usage

There is no advanced usage code provided in the original document.

📚 Documentation

Model Details

Property	Details
Architecture	ViT (3072 width, 26 depth, 24 heads)
Parameters	3B
Resolution	224×224 pixels
Training	Self - supervised Web - DINO on lightly filtered MetaCLIP data

Model Descriptions

Web - SSL DINO 3B is a Vision Transformer model with 3 billion parameters. It is trained using self - supervised learning on lightly filtered web images without language supervision. The "light2b" designation indicates training on a subset of images containing any textual content, retaining approximately 50.3% of the original MetaCLIP dataset. This filtering improves OCR & Chart understanding capabilities while maintaining strong performance across all vision tasks. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

WebSSL Model Overview

📄 License

The model is released under the cc - by - nc - 4.0 license.

🔧 Technical Details

The model is introduced in the paper "Scaling Language - Free Visual Representation Learning" (Fan et al., 2025). It shows the effectiveness of self - supervised learning on web - scale image data for vision tasks without language supervision.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご