WebSSL-Dino2B-Light2B-224 Open-Source Vision Model - Free for Multi-Scene Visual Applications Such as Image Recognition

Webssl Dino2b Light2b 224

Developed by facebook

A 2-billion-parameter vision Transformer model trained using the DINOv2 self-supervised learning framework on lightly filtered web-scale image data (without language supervision).

Image Classification

Transformers

#2B Parameter Vision Transformer #Language-Supervised-Free Learning #Training with Lightly Filtered Web Data

Downloads 27

Release Time : 4/25/2025

Model Overview

This model is trained via self-supervised learning on lightly filtered web image data, focusing on pure visual representation learning. It is suitable for various vision tasks and excels particularly in OCR and chart understanding.

Model Features

Pure Visual Learning

Self-supervised training using only image data, without language supervision.

Lightly Filtered Data

Uses a lightly filtered subset of MetaCLIP data (retaining ~50.3% of original data), balancing data quality and diversity.

Large-Scale Parameters

A 2-billion-parameter vision Transformer architecture providing powerful representation capabilities.

OCR and Chart Understanding Advantage

Enhances OCR and chart understanding while maintaining performance across all vision tasks.

Model Capabilities

Image feature extraction

Visual representation learning

OCR tasks

Chart understanding

Use Cases

Computer Vision

Image Classification

Utilizes image features extracted by the model for classification tasks.

Object Detection

Performs object localization and recognition using the model's patch token features.

Document Analysis

OCR Recognition

Identifies text content in images.

Significant improvement compared to other vision models

Chart Understanding

Interprets charts and data visualizations in images.

Outperforms language-supervised models

🚀 Web-SSL DINO ViT-2B: Light Filtered 2B MetaCLIP data, 224 Resolution

A 2 billion parameter Vision Transformer (ViT) trained with DINOv2 self - supervised learning on lightly filtered web - scale image data without language supervision. It offers a new approach to visual representation learning.

🚀 Quick Start

Web-SSL DINO 2B is a powerful Vision Transformer model. You can quickly start using it with the following steps.

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino2b-light2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino2b-light2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

✨ Features

Self - Supervised Learning: Trained using self - supervised learning on lightly filtered web images without language supervision.
Enhanced Understanding: The filtering process improves OCR & Chart understanding capabilities.
Strong Performance: Demonstrates comparable or superior performance to language - supervised models like CLIP across various vision tasks.

📦 Installation

The model can be installed using the transformers library. Make sure you have transformers installed in your Python environment. You can install it via pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino2b-light2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino2b-light2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Architecture	ViT (2688 width, 24 depth, 21 heads)
Parameters	2B
Resolution	224×224 pixels
Training	Self - supervised Web - DINO on lightly filtered MetaCLIP data

Model Descriptions

Web-SSL DINO 2B is a 2 billion parameter Vision Transformer model trained using self - supervised learning on lightly filtered web images without language supervision. The "light2b" designation indicates training on a subset of images containing any textual content, retaining approximately 50.3% of the original MetaCLIP dataset. This filtering improves OCR & Chart understanding capabilities while maintaining strong performance across all vision tasks. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

WebSSL Model Overview

📄 License

This model is released under the cc-by-nc-4.0 license.

📚 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご