**WebSSL-Dino7B-Full8B-378 Open-Source Vision Model - Achieving Exceptional Visual Representations with 8 Billion Image Training**

Webssl Dino7b Full8b 378

Developed by facebook

A 7-billion-parameter vision Transformer model trained on 8 billion language-unlabeled web images, achieving exceptional visual representation capabilities through self-supervised learning

Image Classification

Transformers

#Unsupervised visual representation #High-resolution processing #Multimodal adaptation

Downloads 68

Release Time : 4/25/2025

Model Overview

This model employs the DINOv2 self-supervised learning method, matching or surpassing the performance of language-supervised models under pure visual learning schemes, suitable for various vision tasks and multimodal applications

Model Features

Large-scale self-supervised training

Trained on 8 billion language-unlabeled web images, validating the feasibility of pure visual learning schemes

High-resolution processing

Supports 378×378 pixel input resolution for capturing finer visual features

Multi-task adaptability

Excellent performance on both traditional vision benchmarks and multimodal tasks

Model Capabilities

Image feature extraction

Visual representation learning

Multimodal task processing

Use Cases

Computer vision

Image classification

Performing image classification tasks using visual features extracted by the model

Object detection

Achieving fine-grained object detection through patch token features

Multimodal applications

Visual question answering

Implementing image content Q&A systems combined with language models

Excellent performance

Chart understanding

Parsing visual information in complex charts

🚀 Web-SSL DINO ViT-7B: 8B MetaCLIP data, 378 Resolution

A 7 billion parameter Vision Transformer (ViT) trained with DINOv2 self - supervised learning on web - scale image data without language supervision.

🚀 Quick Start

Web-SSL DINO 7B is a powerful Vision Transformer model. It is trained using self - supervised learning on 8 billion web images without language supervision. This model can perform well on various vision tasks, including traditional vision benchmarks and multimodal tasks.

✨ Features

High - performance Architecture: The model uses the ViT architecture with a width of 4096, a depth of 32, and 32 heads.
Large - scale Parameters: It has 7 billion parameters, enabling it to learn complex visual patterns.
High - resolution Input: It can handle image inputs with a resolution of 378×378 pixels.
Self - supervised Training: Trained with self - supervised Web - DINO on 8B image samples from MetaCLIP web data, which shows that pure visual learning can achieve excellent performance.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino7b-full8b-378')
model = Dinov2Model.from_pretrained('facebook/webssl-dino7b-full8b-378')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

📚 Documentation

Model Details

Property	Details
Model Type	Vision Transformer (ViT)
Architecture	ViT (4096 width, 32 depth, 32 heads)
Parameters	7B
Resolution	378×378 pixels
Training Data	Self - supervised Web - DINO on 8B image samples from MetaCLIP web data

Model Descriptions

Web-SSL DINO 7B is a 7 billion parameter Vision Transformer model trained using self - supervised learning on 8 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language - supervised models like CLIP across various vision tasks. It performs well on both traditional vision benchmarks and multimodal tasks including visual question answering and OCR & chart understanding.

WebSSL Model Overview

📄 License

This project is licensed under the cc - by - nc - 4.0 license.

📄 Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご