B

Beit Large Patch16 224

Developed by microsoft
BEiT is an image classification model based on Vision Transformer (ViT) architecture, pretrained with self-supervised learning on ImageNet-21k and fine-tuned on ImageNet-1k.
Downloads 222.46k
Release Time : 3/2/2022

Model Overview

The BEiT model adopts a BERT-like Transformer encoder architecture, performing self-supervised pretraining by predicting masked image patches' visual tokens, ultimately used for image classification tasks.

Model Features

Self-supervised pretraining
Employs BERT-like masked prediction method for self-supervised pretraining on ImageNet-21k
Relative position encoding
Uses T5-style relative position encoding instead of absolute position encoding
Efficient feature extraction
Performs classification by average pooling all image patches' final hidden states rather than relying on [CLS] token

Model Capabilities

Image classification
Visual feature extraction

Use Cases

Computer vision
ImageNet image classification
Classifies input images into one of 1000 ImageNet categories
Demonstrates excellent performance on ImageNet benchmarks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase