🚀 BERT-tiny model finetuned with M-FAC
This project presents a BERT-tiny model that has been finetuned using the state-of-the-art second-order optimizer, M-FAC, on the SST-2 dataset. It offers a more efficient way for text classification tasks.
🚀 Quick Start
Prerequisites
To reproduce the results, you need to set up the environment as described in https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification.
Installation
You can install the necessary dependencies by following the instructions in the above repository.
Usage
To run the model, you can use the following bash script:
CUDA_VISIBLE_DEVICES=0 python run_glue.py \
--seed 42 \
--model_name_or_path prajjwal1/bert-tiny \
--task_name sst2 \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--output_dir out_dir/ \
--optim MFAC \
--optim_args '{"lr": 1e-4, "num_grads": 1024, "damp": 1e-6}'
✨ Features
- Advanced Optimizer: Utilizes the M-FAC optimizer, a state-of-the-art second-order optimizer, which can potentially lead to better performance compared to traditional optimizers like Adam.
- Fair Comparison: The model is finetuned in the same framework as the default Adam baseline, ensuring a fair comparison.
- Reproducible Results: The detailed hyperparameters and running scripts are provided, allowing users to reproduce the results.
📦 Installation
The installation process is based on the framework described in https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification. You just need to swap the Adam optimizer with M-FAC.
Hyperparameters for M-FAC
learning rate = 1e-4
number of gradients = 1024
dampening = 1e-6
💻 Usage Examples
Basic Usage
To reproduce the results, you can use the following bash script:
CUDA_VISIBLE_DEVICES=0 python run_glue.py \
--seed 42 \
--model_name_or_path prajjwal1/bert-tiny \
--task_name sst2 \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--output_dir out_dir/ \
--optim MFAC \
--optim_args '{"lr": 1e-4, "num_grads": 1024, "damp": 1e-6}'
Advanced Usage
You can adjust the hyperparameters such as per_device_train_batch_size
, learning_rate
, num_train_epochs
, num_grads
and damp
to potentially improve the performance.
📚 Documentation
For more details on the M-FAC optimizer, please check the NeurIPS 2021 paper: https://arxiv.org/pdf/2107.03356.pdf.
🔧 Technical Details
The model is finetuned on the SST-2 dataset. For fair comparison, it follows the same framework as the default Adam baseline, only replacing the Adam optimizer with M-FAC.
📄 License
No license information is provided in the original document.
Results
Best Model Score
We share the best model out of 5 runs with the following score on SST-2 validation set:
accuracy = 83.02
Mean and Standard Deviation
Property |
Details |
Adam |
80.11 ± 0.65 |
M-FAC |
81.86 ± 0.76 |
BibTeX Entry
@article{frantar2021m,
title={M-FAC: Efficient Matrix-Free Approximations of Second-Order Information},
author={Frantar, Elias and Kurtic, Eldar and Alistarh, Dan},
journal={Advances in Neural Information Processing Systems},
volume={35},
year={2021}
}