W

Webssl Dino3b Light2b 224

Developed by facebook
A 3-billion parameter Vision Transformer model trained using DINOv2 self-supervised learning on lightly filtered web-scale image data, without language supervision.
Downloads 25
Release Time : 4/25/2025

Model Overview

This is a Vision Transformer model based on ViT architecture, trained via self-supervised learning on lightly filtered web images, focusing on pure visual representation learning without language supervision.

Model Features

Self-supervised learning
Uses DINOv2 self-supervised learning method to learn effective visual representations without language supervision.
Lightly filtered data training
Trained on lightly filtered MetaCLIP dataset retaining approximately 50.3% of original samples, enhancing OCR and chart understanding capabilities.
Large-scale parameters
Features a massive 3-billion parameter Vision Transformer architecture capable of capturing richer visual features.

Model Capabilities

Image feature extraction
Visual representation learning
Enhanced OCR capability
Chart understanding

Use Cases

Computer vision
Image classification
Can be used for image classification tasks to extract effective visual features.
Object detection
Can serve as a base feature extractor for object detection tasks.
Document analysis
OCR enhancement
Due to the lightly filtered nature of training data, the model performs exceptionally well on OCR-related tasks.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase