Open-source fine-tuned ViT safety helmet detection model - Accurately designed for industrial scenarios

Home

Finetuned ViT Model

Developed by bnina-ayoub

Hardhat detection model fine-tuned based on DETR-ResNet50 architecture, designed for industrial scenarios

Object Detection

Transformers

EnglishOpen Source License:MIT #Hardhat Detection #Industrial Vision #DETR Architecture

Downloads 21

Release Time : 3/27/2025

Model Overview

This model is a Vision Transformer fine-tuned for hardhat detection tasks based on facebook/detr-resnet-50-dc5, primarily used to identify hardhats, heads, and human bodies in images

Model Features

Industrial Scenario Optimization

Specifically fine-tuned for hardhat detection in construction sites and industrial environments

Transformer Architecture

Utilizes advanced DETR architecture combining CNN and Transformer advantages

Lightweight Deployment

Based on ResNet-50 backbone network, balancing performance and computational efficiency

Model Capabilities

Hardhat Detection

Human Detection

Head Detection

Industrial Scene Image Analysis

Use Cases

Industrial Safety Monitoring

Construction Site Safety Compliance Check

Automatically detect whether workers are wearing hardhats

Can identify hardhat wearing/non-wearing status

Industrial Zone Personnel Monitoring

Detect personnel and their safety equipment in restricted areas

Provides personnel location and safety equipment status information

🚀 finetuned-ViT-model

This model is a fine - tuned Vision Transformer (ViT) for object detection, trained to detect hard hats, heads, and people in images, offering educational value and potential safety applications.

🚀 Quick Start

This model is a fine-tuned version of facebook/detr-resnet-50-dc5 on the Hard Hat Dataset. It achieves a loss of 0.9937 on the evaluation set.

✨ Features

This model is a demonstration project for the Hugging Face Certification assignment, created for educational purposes.
It can be used to demonstrate object detection with ViT and has potential applications in safety scenarios for construction sites or industrial environments.
The model leverages the transformer architecture to process image patches and predict bounding boxes and labels for objects of interest.

📚 Documentation

Model description

This model is a fine-tuned Vision Transformer (ViT) for object detection. It uses the facebook/detr-resnet-50-dc5 checkpoint as a base and is further trained on the hf-vision/hardhat dataset. Specifically, it is trained to detect hard hats, heads, and people in images.

Intended uses & limitations

Intended Uses: This model can be used to demonstrate object detection with ViT. It can potentially be used in safety applications to identify individuals wearing or not wearing hardhats in construction sites or industrial environments.
Limitations: This model has been limitedly trained and may not generalize well to images with significantly different characteristics, viewpoints, or lighting conditions. It is not intended for production use without further evaluation and validation.

Training and evaluation data

Dataset: The model was trained on the hf-vision/hardhat dataset from Hugging Face Datasets. This dataset contains images of construction sites and industrial settings with annotations for hardhats, heads, and people.
Data splits: The dataset is divided into "train" and "test" splits.
Data augmentation: Data augmentation was applied during training using albumentations to improve model generalization. These included random horizontal flipping and random brightness/contrast adjustments.

Training procedure

Base model: The model was initialized from the facebook/detr-resnet-50-dc5 checkpoint, a pre - trained DETR model with a ResNet - 50 backbone.
Fine - tuning: The model was fine - tuned using the Hugging Face Trainer with the following hyperparameters:
- Learning rate: 1e - 6
- Weight decay: 1e - 4
- Batch size: 1
- Epochs: 3
- Max steps: 2500
- Optimizer: AdamW
Evaluation: The model was evaluated on the test set using standard object detection metrics, including COCO metrics (Average Precision, Average Recall).
Hardware: Training was performed on Google Colab using GPU acceleration.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 06
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Use adamw_torch with betas=(0.9, 0.999) and epsilon = 1e - 08 and optimizer_args = No additional optimizer arguments
lr_scheduler_type: linear
training_steps: 500
mixed_precision_training: Native AMP

Framework versions

Property	Details
Transformers	4.50.1
Pytorch	2.5.1+cu121
Datasets	3.4.1
Tokenizers	0.21.0

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご