Open-source Vision Model webssl-dino1b-full2b-224 - Language-agnostic Supervised Learning of Visual Representations

Webssl Dino1b Full2b 224

Developed by facebook

This is a 1-billion-parameter vision Transformer model trained on 2 billion web images through DINOv2 self-supervised learning, capable of learning visual representations without language supervision.

Image Classification

Transformers

#Self-supervised visual learning #Billion-parameter scale #Language-free supervision

Downloads 1,172

Release Time : 4/25/2025

Model Overview

This model demonstrates that pure visual learning can match or exceed the performance of language-supervised models when scaled appropriately, suitable for various visual tasks.

Model Features

Large-scale self-supervised learning

Trained on 2 billion web images without language supervision

High-performance visual representation

Achieves or exceeds the performance of language-supervised models on various visual tasks

Efficient architecture design

Utilizes ViT architecture with width 1536, depth 40, and 24 heads

Model Capabilities

Image feature extraction

Visual representation learning

Image classification

Object detection

Use Cases

Computer Vision

Image classification

Using the model's extracted image features for classification tasks

Object detection

Leveraging the visual representations learned by the model for object detection

🚀 Web-SSL DINO ViT-1B: 2B MetaCLIP data, 224 Resolution

A 1 billion parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on web-scale image data without language supervision.

🚀 Quick Start

Web-SSL DINO 1B is a powerful Vision Transformer model. It's trained using self - supervised learning on 2 billion web images without language supervision. This model shows that proper - scaled pure visual learning can match or outperform language - supervised models like CLIP in various vision tasks.

✨ Features

High - capacity Model: With 1 billion parameters, it can capture complex visual patterns.
Self - supervised Learning: Trained without language supervision, relying on self - supervised Web - DINO on a large web image dataset.
Versatile Performance: Can match or exceed the performance of language - supervised models in multiple vision tasks.

📦 Installation

There is no specific installation steps provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino1b-full2b-224')
# 'eager' and 'sdpa' attn_implementation supported
model = Dinov2Model.from_pretrained('facebook/webssl-dino1b-full2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Model Type	Vision Transformer (ViT)
Architecture	ViT (1536 width, 40 depth, 24 heads)
Parameters	1B
Resolution	224×224 pixels
Training Data	Self - supervised Web - DINO on 2B image samples from MetaCLIP web data

Model Descriptions

Web-SSL DINO 1B is a 1 billion parameter Vision Transformer model trained using self - supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks.

WebSSL Model Overview

📄 License

The model is released under the cc - by - nc - 4.0 license.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご