đ finetuned-ViT-model
This model is a fine - tuned Vision Transformer (ViT) for object detection, trained to detect hard hats, heads, and people in images, offering educational value and potential safety applications.
đ Quick Start
This model is a fine-tuned version of facebook/detr-resnet-50-dc5 on the Hard Hat Dataset. It achieves a loss of 0.9937 on the evaluation set.
⨠Features
- This model is a demonstration project for the Hugging Face Certification assignment, created for educational purposes.
- It can be used to demonstrate object detection with ViT and has potential applications in safety scenarios for construction sites or industrial environments.
- The model leverages the transformer architecture to process image patches and predict bounding boxes and labels for objects of interest.
đ Documentation
Model description
This model is a fine-tuned Vision Transformer (ViT) for object detection. It uses the facebook/detr-resnet-50-dc5
checkpoint as a base and is further trained on the hf-vision/hardhat
dataset. Specifically, it is trained to detect hard hats, heads, and people in images.
Intended uses & limitations
- Intended Uses: This model can be used to demonstrate object detection with ViT. It can potentially be used in safety applications to identify individuals wearing or not wearing hardhats in construction sites or industrial environments.
- Limitations: This model has been limitedly trained and may not generalize well to images with significantly different characteristics, viewpoints, or lighting conditions. It is not intended for production use without further evaluation and validation.
Training and evaluation data
- Dataset: The model was trained on the
hf-vision/hardhat
dataset from Hugging Face Datasets. This dataset contains images of construction sites and industrial settings with annotations for hardhats, heads, and people.
- Data splits: The dataset is divided into "train" and "test" splits.
- Data augmentation: Data augmentation was applied during training using
albumentations
to improve model generalization. These included random horizontal flipping and random brightness/contrast adjustments.
Training procedure
- Base model: The model was initialized from the
facebook/detr-resnet-50-dc5
checkpoint, a pre - trained DETR model with a ResNet - 50 backbone.
- Fine - tuning: The model was fine - tuned using the Hugging Face
Trainer
with the following hyperparameters:
- Learning rate: 1e - 6
- Weight decay: 1e - 4
- Batch size: 1
- Epochs: 3
- Max steps: 2500
- Optimizer: AdamW
- Evaluation: The model was evaluated on the test set using standard object detection metrics, including COCO metrics (Average Precision, Average Recall).
- Hardware: Training was performed on Google Colab using GPU acceleration.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e - 06
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9, 0.999) and epsilon = 1e - 08 and optimizer_args = No additional optimizer arguments
- lr_scheduler_type: linear
- training_steps: 500
- mixed_precision_training: Native AMP
Framework versions
Property |
Details |
Transformers |
4.50.1 |
Pytorch |
2.5.1+cu121 |
Datasets |
3.4.1 |
Tokenizers |
0.21.0 |
đ License
This model is released under the MIT license.