WebSSL - Dino3b - Full2b - 224 Open-Source Vision Model - Free Deployment for Learning Powerful Visual Representations

Webssl Dino3b Full2b 224

Developed by facebook

This is a 3-billion parameter vision Transformer model trained on 2 billion web images through DINOv2 self-supervised learning, capable of learning powerful visual representations without language supervision.

Image Classification

Transformers

#Self-supervised visual representation #3B parameter ViT #Language-free supervision

Downloads 72

Release Time : 4/25/2025

Model Overview

This model demonstrates that pure visual learning can match or exceed the performance of language-supervised models across various vision tasks, suitable for traditional vision benchmarks and multimodal tasks.

Model Features

Large-scale self-supervised learning

Trained on 2 billion web images, learning powerful visual representations without language supervision

High-performance vision model

Matches or exceeds the performance of language-supervised models in various vision tasks

Multi-task applicability

Suitable for traditional vision benchmarks as well as multimodal tasks like visual question answering, OCR, and chart understanding

Model Capabilities

Image feature extraction

Visual representation learning

Multimodal task processing

Use Cases

Computer vision

Image classification

Used for image classification tasks

Excellent performance on traditional vision benchmarks

Visual question answering

Handles question-answering tasks requiring visual understanding

Document analysis

OCR

Optical character recognition applications

Chart understanding

Parsing and understanding chart content

🚀 Web-SSL DINO ViT-3B: 2B MetaCLIP data, 224 Resolution

A 3 billion parameter Vision Transformer (ViT) trained with DINOv2 self - supervised learning on web - scale image data without language supervision.

🚀 Quick Start

Web-SSL DINO 3B is a powerful Vision Transformer model. It's trained via self - supervised learning on a large amount of web images without language supervision. This model can achieve comparable or better performance than language - supervised models in various vision tasks.

✨ Features

High - Capacity Model: With 3 billion parameters, it has strong representational ability.
Self - Supervised Learning: Trained on web - scale image data without language supervision, demonstrating the effectiveness of pure visual learning.
Good Performance: Performs well on both traditional vision benchmarks and multimodal tasks such as visual question answering and OCR & chart understanding.

📦 Installation

The provided README doesn't contain specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-full2b-224')
# 'eager' and 'sdpa' attn_implementation supported
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-full2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Architecture	ViT (3072 width, 26 depth, 24 heads)
Parameters	3B
Resolution	224×224 pixels
Training	Self - supervised Web - DINO on 2B image samples from MetaCLIP web data

Model Descriptions

Web-SSL DINO 3B is a 3 billion parameter Vision Transformer model trained using self - supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks. It performs well on both traditional vision benchmarks and multimodal tasks including visual question answering and OCR & chart understanding.

WebSSL Model Overview

📄 License

The model is licensed under cc - by - nc - 4.0.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご