đ DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
DeBERTaV3 enhances the efficiency of DeBERTa by leveraging ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing, significantly improving model performance on downstream tasks.
DeBERTa enhances the BERT and RoBERTa models through disentangled attention and an enhanced mask decoder. With these two improvements, DeBERTa outperforms RoBERTa on a majority of NLU tasks using 80GB of training data.
In DeBERTa V3, we further boost the efficiency of DeBERTa by using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa, our V3 version notably improves the model's performance on downstream tasks. You can find more technical details about the new model in our paper.
Please refer to the official repository for more implementation details and updates.
mDeBERTa is the multilingual version of DeBERTa, which has the same structure as DeBERTa and was trained with CC100 multilingual data. The mDeBERTa V3 base model has 12 layers and a hidden size of 768. It has 86M backbone parameters and a vocabulary containing 250K tokens, which introduces 190M parameters in the Embedding layer. This model was trained using 2.5T of CC100 data, similar to XLM-R.
⨠Features
- Enhanced Performance: DeBERTaV3 significantly improves the performance on downstream NLU tasks compared to DeBERTa.
- Multilingual Support: mDeBERTa provides multilingual capabilities, trained with CC100 multilingual data.
đĻ Installation
The README doesn't provide specific installation steps, so this section is skipped.
đģ Usage Examples
Fine-tuning on NLU tasks
We present the dev results on XNLI with zero-shot cross-lingual transfer setting, i.e., training with English data only and testing on other languages.
Property |
Details |
Model Type |
mDeBERTa-base |
Training Data |
English data for training, tested on multiple languages (en, fr, es, de, el, bg, ru, tr, ar, vi, th, zh, hi, sw, ur) |
Model |
avg |
en |
fr |
es |
de |
el |
bg |
ru |
tr |
ar |
vi |
th |
zh |
hi |
sw |
ur |
XLM-R-base |
76.2 |
85.8 |
79.7 |
80.7 |
78.7 |
77.5 |
79.6 |
78.1 |
74.2 |
73.8 |
76.5 |
74.6 |
76.7 |
72.4 |
66.5 |
68.3 |
mDeBERTa-base |
79.8+/-0.2 |
88.2 |
82.6 |
84.4 |
82.7 |
82.3 |
82.4 |
80.8 |
79.5 |
78.5 |
78.1 |
76.4 |
79.5 |
75.9 |
73.9 |
72.4 |
Fine-tuning with HF transformers
#!/bin/bash
cd transformers/examples/pytorch/text-classification/
pip install datasets
output_dir="ds_results"
num_gpus=8
batch_size=4
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
run_xnli.py \
--model_name_or_path microsoft/mdeberta-v3-base \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--train_language en \
--language en \
--evaluation_strategy steps \
--max_seq_length 256 \
--warmup_steps 3000 \
--per_device_train_batch_size ${batch_size} \
--learning_rate 2e-5 \
--num_train_epochs 6 \
--output_dir $output_dir \
--overwrite_output_dir \
--logging_steps 1000 \
--logging_dir $output_dir
đ Documentation
The README provides detailed information about the model, its improvements, and fine-tuning results. You can find more technical details in the paper.
đ License
This project is licensed under the MIT license.
đ Citation
If you find DeBERTa useful for your work, please cite the following papers:
@misc{he2021debertav3,
title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing},
author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
year={2021},
eprint={2111.09543},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}