WebSSL - Dino300m - full2b - 224 Open-source Visual Model - Realize Visual Applications Such as Image Recognition Based on Massive Data

Webssl Dino300m Full2b 224

Developed by facebook

A 224-resolution Vision Transformer model based on 2 billion MetaCLIP data, trained using DINOv2 self-supervised learning method

Image Classification

Transformers

#Self-supervised visual representation #300M parameter ViT #Language-free supervision

Downloads 503

Release Time : 4/25/2025

Model Overview

This is a 300-million-parameter Vision Transformer model trained via self-supervised learning on 2 billion web images without language supervision, suitable for various visual tasks.

Model Features

Large-scale self-supervised learning

Trained on 2 billion web images without any language supervision

High-performance visual representation

Performance comparable to or surpassing language-supervised models across various visual tasks

High-resolution processing

Supports 224×224 pixel resolution input

Model Capabilities

Image feature extraction

Visual representation learning

Image classification

Object detection

Use Cases

Computer vision

Image classification

Perform image classification tasks using features extracted by the model

Object detection

Achieve efficient object detection by combining with detection heads

🚀 Web-SSL DINO ViT-300M: 2B MetaCLIP data, 224 Resolution

A 300 million parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on web-scale image data without language supervision.

🚀 Quick Start

Web-SSL DINO 300M is a 300 million parameter Vision Transformer model trained using self-supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language-supervised models like CLIP across various vision tasks.

✨ Features

Architecture: ViT (1536 width, 40 depth, 24 heads)
Parameters: 300M
Resolution: 224×224 pixels
Training: Self-supervised Web-DINO on 2B image samples from MetaCLIP web data

Here is an overview of the WebSSL Model: WebSSL Model Overview

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino300m-full2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino300m-full2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

The model was introduced in "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).

📄 License

This project is licensed under the CC BY-NC 4.0 license.

📚 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご