WebSSL - Dino2B - Full2B - 224 Open - Source Vision Model: Free Deployment to Facilitate Efficient Completion of Multimodal Tasks

Webssl Dino2b Full2b 224

Developed by facebook

A 2-billion parameter vision Transformer model trained on 2 billion web images through pure visual self-supervised learning, excelling in multimodal tasks

Image Classification

Transformers

#2-billion parameter vision model #language-supervision-free learning #web-scale image training

Downloads 50

Release Time : 4/25/2025

Model Overview

This is a 2-billion parameter vision Transformer model trained using the DINOv2 self-supervised learning framework, requiring no language supervision, achieving performance on par with or surpassing language-supervised models in various vision tasks

Model Features

Pure visual self-supervised learning

No language supervision required, trained solely on visual data

Large-scale training

Trained on 2 billion web image samples

High performance

Excellent performance on traditional vision benchmarks and multimodal tasks

Dual attention implementation

Supports both 'eager' and 'sdpa' attention implementations

Model Capabilities

Image feature extraction

Visual representation learning

Multimodal task processing

Visual question answering

OCR recognition

Chart understanding

Use Cases

Computer vision

Image classification

Utilizing image features extracted by the model for classification tasks

Performance on par with or surpassing language-supervised models

Object detection

Object localization through the model's patch token features

Multimodal applications

Visual question answering

Combining with language models to answer questions about image content

Excellent performance

Chart understanding

Parsing and understanding visual information in charts

🚀 Web-SSL DINO ViT-2B: 2B MetaCLIP data, 224 Resolution

A 2 billion parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on web-scale image data without language supervision.

🚀 Quick Start

Web-SSL DINO 2B is a 2 billion parameter Vision Transformer model. It's trained using self - supervised learning on 2 billion web images without language supervision. This model shows that proper - scaled pure visual learning can match or outperform language - supervised models like CLIP in various vision tasks.

✨ Features

It performs well on both traditional vision benchmarks and multimodal tasks including visual question answering and OCR & chart understanding.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino2b-full2b-224')
# 'eager' and 'sdpa' attn_implementation supported
model = Dinov2Model.from_pretrained('facebook/webssl-dino2b-full2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Architecture	ViT (2688 width, 24 depth, 21 heads)
Parameters	2B
Resolution	224×224 pixels
Training	Self-supervised Web-DINO on 2B image samples from MetaCLIP web data

Model Descriptions

Web-SSL DINO 2B is a 2 billion parameter Vision Transformer model trained using self-supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language-supervised models like CLIP across various vision tasks. It performs well on both traditional vision benchmarks and multimodal tasks including visual question answering and OCR & chart understanding.

WebSSL Model Overview

📄 License

The model is released under the cc-by-nc-4.0 license.

📚 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご