đ DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
DeBERTaV3 enhances the efficiency of DeBERTa through ELECTRA-Style pre - training with Gradient Disentangled Embedding Sharing, significantly boosting performance on downstream NLU tasks.
đ Quick Start
If you're interested in the implementation details and updates of DeBERTaV3, please visit the official repository.
⨠Features
- Improved Architecture: DeBERTa enhances BERT and RoBERTa models with disentangled attention and an enhanced mask decoder. With 80GB of training data, it outperforms RoBERTa in most NLU tasks.
- Efficient Pre - training: In DeBERTa V3, we further improve efficiency using ELECTRA - Style pre - training with Gradient Disentangled Embedding Sharing, leading to better performance on downstream tasks compared to DeBERTa.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Fine - tuning on NLU tasks
We present the development results on SQuAD 2.0 and MNLI tasks:
Property |
Details |
Model Type |
RoBERTa - large, XLNet - large, DeBERTa - large, DeBERTa - v3 - large |
Vocabulary(K) |
50 (RoBERTa - large, DeBERTa - large), 32 (XLNet - large), 128 (DeBERTa - v3 - large) |
Backbone #Params(M) |
304 (RoBERTa - large, DeBERTa - v3 - large) |
SQuAD 2.0(F1/EM) |
89.4/86.5 (RoBERTa - large), 90.6/87.9 (XLNet - large), 90.7/88.0 (DeBERTa - large), 91.5/89.0 (DeBERTa - v3 - large) |
MNLI - m/mm(ACC) |
90.2 (RoBERTa - large), 90.8 (XLNet - large), 91.3 (DeBERTa - large), 91.8/91.9 (DeBERTa - v3 - large) |
Fine - tuning with HF transformers
#!/bin/bash
cd transformers/examples/pytorch/text - classification/
pip install datasets
export TASK_NAME=mnli
output_dir="ds_results"
num_gpus=8
batch_size=8
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
run_glue.py \
--model_name_or_path microsoft/deberta - v3 - large \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--evaluation_strategy steps \
--max_seq_length 256 \
--warmup_steps 50 \
--per_device_train_batch_size ${batch_size} \
--learning_rate 6e - 6 \
--num_train_epochs 2 \
--output_dir $output_dir \
--overwrite_output_dir \
--logging_steps 1000 \
--logging_dir $output_dir
đ Documentation
The DeBERTa V3 large model consists of 24 layers with a hidden size of 1024. It has 304M backbone parameters and a vocabulary of 128K tokens, introducing 131M parameters in the Embedding layer. This model was trained using 160GB of data, similar to DeBERTa V2.
đ§ Technical Details
You can find more technical details about the new model from our paper.
đ License
This project is licensed under the MIT license.
Citation
If you find DeBERTa useful for your work, please cite the following papers:
@misc{he2021debertav3,
title={DeBERTaV3: Improving DeBERTa using ELECTRA - Style Pre - Training with Gradient - Disentangled Embedding Sharing},
author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
year={2021},
eprint={2111.09543},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING - ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}