WebSSL-Dino7B-Full8B-518 Open-Source Visual Model - Trained on Massive Data without Language Supervision

Webssl Dino7b Full8b 518

Developed by facebook

A 7-billion-parameter visual Transformer model trained on 8 billion MetaCLIP data using the DINOv2 self-supervised learning framework, requiring no language supervision

Image Classification

Transformers

#Language-free visual learning #518 high resolution #8 billion data training

Downloads 157

Release Time : 4/25/2025

Model Overview

This is a visual Transformer model trained on web-scale image data through self-supervised learning, demonstrating that pure visual learning solutions can match or even surpass the performance of language-supervised models in various vision tasks

Model Features

Pure visual self-supervised learning

Completely language-free supervision, trained solely on web image data

Large-scale training data

Trained on 8 billion MetaCLIP web image samples

High-resolution processing

Supports high-resolution image input of 518×518 pixels

Multi-task adaptability

Outstanding performance in traditional vision benchmarks and multimodal tasks

Model Capabilities

Image feature extraction

Visual representation learning

Visual question answering

OCR recognition

Chart understanding

Use Cases

Computer vision

Image classification

Feature extraction for image classification tasks

Outstanding performance in traditional vision benchmarks

Object detection

Serves as a base feature extractor for object detection tasks

Multimodal applications

Visual question answering

Used for question-answering systems requiring image content understanding

Document understanding

Used for OCR and document layout analysis

🚀 Web-SSL DINO ViT-7B: 8B MetaCLIP data, 518 Resolution

A 7 billion parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on web-scale image data without language supervision.

🚀 Quick Start

Web-SSL DINO 7B is a Vision Transformer model with 7 billion parameters. It is trained using self - supervised learning on 8 billion web images without language supervision. This model shows that proper - scaled pure visual learning can match or outperform language - supervised models like CLIP in various vision tasks.

✨ Features

High - Performance Vision Model: Demonstrates excellent performance in traditional vision benchmarks and multimodal tasks such as visual question answering and OCR & chart understanding.
Self - Supervised Learning: Trained on a large - scale web image dataset without language supervision, highlighting the power of pure visual learning.

📦 Installation

The installation process is mainly about setting up the transformers library. You can install it via pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino7b-full8b-518')
model = Dinov2Model.from_pretrained('facebook/webssl-dino7b-full8b-518')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Model Type	Vision Transformer (ViT)
Architecture	ViT (4096 width, 32 depth, 32 heads)
Parameters	7B
Resolution	518×518 pixels
Training Data	Self - supervised Web - DINO on 8B image samples from MetaCLIP web data

Model Descriptions

Web-SSL DINO 7B is trained using self - supervised learning on 8 billion web images without language supervision. This model shows that when pure visual learning is scaled appropriately, it can match or exceed the performance of language - supervised models like CLIP in various vision tasks. It performs well in both traditional vision benchmarks and multimodal tasks including visual question answering and OCR & chart understanding.

WebSSL Model Overview

📄 License

This project is licensed under the CC - BY - NC - 4.0 license.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご