B

Beit Base Finetuned Ade 640 640

Developed by microsoft
BEiT is a model based on the Vision Transformer (ViT) architecture, pre-trained on ImageNet-21k through self-supervised learning and fine-tuned on the ADE20k dataset, specifically designed for image semantic segmentation tasks.
Downloads 1,645
Release Time : 3/2/2022

Model Overview

The BEiT model adopts a BERT-like Transformer encoder architecture, pre-trained via masked image patch prediction, supporting high-resolution image semantic segmentation, suitable for computer vision tasks such as scene parsing.

Model Features

Self-Supervised Pre-training
Pre-trained on the ImageNet-21k dataset via masked image patch prediction to learn intrinsic image representations.
High-Resolution Fine-tuning
Fine-tuned on the ADE20k dataset at 640x640 resolution to optimize semantic segmentation performance.
Relative Position Encoding
Uses T5-like relative position encoding instead of absolute position encoding to enhance model flexibility.

Model Capabilities

Image Semantic Segmentation
Scene Parsing
Visual Feature Extraction

Use Cases

Computer Vision
Building Scene Parsing
Performs semantic segmentation on images containing buildings such as houses and castles to identify different object regions.
Achieves state-of-the-art results on the ADE20k benchmark dataset.
Urban Landscape Analysis
Analyzes urban street images to identify elements such as roads, vehicles, and pedestrians.
Performs excellently on datasets like CityScapes.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase