BEiT-FaceMask-Finetuned Open Source Model - Free Deployment for Precise Mask Status Detection

Beit FaceMask Finetuned

Developed by AkshatSurolia

A Vision Transformer model based on the BEiT architecture, specifically designed for mask detection tasks, fine-tuned on the Face-Mask18K dataset.

Image Classification

Transformers

Open Source License:Apache-2.0 #High-precision mask recognition #Vision Transformer #Medical scenario adaptation

Downloads 23

Release Time : 3/2/2022

Model Overview

This model adopts the BEiT architecture, pre-trained on ImageNet-21k through self-supervised learning, and fine-tuned on the Face-Mask18K dataset containing 18,000 images to detect whether a mask is worn in an image.

Model Features

Self-supervised Pre-training

Utilizes BEiT's self-supervised pre-training method to learn general image representations, improving downstream task performance.

Relative Position Encoding

Employs relative position encoding similar to the T5 model, replacing the absolute position encoding in traditional ViT, enhancing model flexibility.

Efficient Fine-tuning

Fine-tuned on the Face-Mask18K dataset, achieving high accuracy with only a small amount of labeled data.

Model Capabilities

Image Classification

Mask Detection

Visual Feature Extraction

Use Cases

Public Health

Public Space Mask-Wearing Detection

Used to monitor whether people in public spaces are wearing masks, assisting in epidemic prevention management.

Evaluation accuracy reaches 97.5%

Smart Security

Access Control System Identity Verification

Combined with facial recognition, detects whether a person is wearing a mask during identity verification.

🚀 BEiT for Face Mask Detection

The BEiT model pre-trained and fine-tuned on a self-curated custom Face-Mask18K Dataset (18k images, 2 classes) at a resolution of 224x224, designed for face mask detection.

🚀 Quick Start

This BEiT model is pre-trained and fine-tuned on the Self Currated Custom Face-Mask18K Dataset (18k images, 2 classes) at a resolution of 224x224. It was introduced in the paper "BEIT: BERT Pre-Training of Image Transformers" by Hangbo Bao, Li Dong, and Furu Wei.

✨ Features

Image Classification: Specifically designed for face mask detection.
Pre - training and Fine - tuning: Pre - trained on ImageNet - 21k and fine - tuned on ImageNet.
Advanced Architecture: A Vision Transformer (ViT) with relative position embeddings.

📚 Documentation

Model description

The BEiT model is a Vision Transformer (ViT), which is a transformer encoder model (BERT - like). In contrast to the original ViT model, BEiT is pretrained on a large collection of images in a self - supervised fashion, namely ImageNet - 21k, at a resolution of 224x224 pixels. The pre - training objective for the model is to predict visual tokens from the encoder of OpenAI's DALL - E's VQ - VAE, based on masked patches. Next, the model was fine - tuned in a supervised fashion on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

Images are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. Contrary to the original ViT models, BEiT models do use relative position embeddings (similar to T5) instead of absolute position embeddings, and perform classification of images by mean - pooling the final hidden states of the patches, instead of placing a linear layer on top of the final hidden state of the [CLS] token.

By pre - training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. Alternatively, one can mean - pool the final hidden states of the patch embeddings, and place a linear layer on top of that.

Training Metrics

Property	Details
Epoch	0.55
Total FLOs	576468516GF
Train Loss	0.151
Train Runtime	0:58:16.56
Train Samples per Second	16.505
Train Steps per Second	1.032

Evaluation Metrics

Property	Details
Epoch	0.55
Eval Accuracy	0.975
Eval Loss	0.0803
Eval Runtime	0:03:13.02
Eval Samples per Second	18.629
Eval Steps per Second	2.331

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご