đ DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTa improves upon the BERT and RoBERTa models by leveraging disentangled attention and an enhanced mask decoder. Trained with 80GB of data, it outperforms its predecessors in most NLU tasks.
For more details and updates, please visit the official repository. This is the DeBERTa V2 xxlarge model, featuring 48 layers and a 1536 hidden size. With a total of 1.5B parameters, it was trained on 160GB of raw data.
đ Quick Start
This section provides an overview of the DeBERTa model and its performance on various NLU tasks. You can find detailed installation and usage instructions below.
⨠Features
- Disentangled Attention: DeBERTa uses disentangled attention to improve the understanding of language structure.
- Enhanced Mask Decoder: The enhanced mask decoder helps in better prediction of masked tokens.
- High Performance: Outperforms BERT and RoBERTa on most NLU tasks.
đĻ Installation
To run the DeBERTa model, you need to install the necessary dependencies. You can use Deepspeed
or --sharded_ddp
for training.
Install with Deepspeed
pip install datasets
pip install deepspeed
wget https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/ds_config.json -O ds_config.json
Install with --sharded_ddp
cd transformers/examples/text-classification/
đģ Usage Examples
Run with Deepspeed
export TASK_NAME=mnli
output_dir="ds_results"
num_gpus=8
batch_size=8
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
run_glue.py \
--model_name_or_path microsoft/deberta-v2-xxlarge \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--max_seq_length 256 \
--per_device_train_batch_size ${batch_size} \
--learning_rate 3e-6 \
--num_train_epochs 3 \
--output_dir $output_dir \
--overwrite_output_dir \
--logging_steps 10 \
--logging_dir $output_dir \
--deepspeed ds_config.json
Run with --sharded_ddp
export TASK_NAME=mnli
python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \
--task_name $TASK_NAME --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 8 \
--learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
đ Documentation
Fine-tuning on NLU tasks
The following table shows the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.
Model |
SQuAD 1.1 |
SQuAD 2.0 |
MNLI-m/mm |
SST-2 |
QNLI |
CoLA |
RTE |
MRPC |
QQP |
STS-B |
|
F1/EM |
F1/EM |
Acc |
Acc |
Acc |
MCC |
Acc |
Acc/F1 |
Acc/F1 |
P/S |
BERT-Large |
90.9/84.1 |
81.8/79.0 |
86.6/- |
93.2 |
92.3 |
60.6 |
70.4 |
88.0/- |
91.3/- |
90.0/- |
RoBERTa-Large |
94.6/88.9 |
89.4/86.5 |
90.2/- |
96.4 |
93.9 |
68.0 |
86.6 |
90.9/- |
92.2/- |
92.4/- |
XLNet-Large |
95.1/89.7 |
90.6/87.9 |
90.8/- |
97.0 |
94.9 |
69.0 |
85.9 |
90.8/- |
92.3/- |
92.5/- |
DeBERTa-Large1 |
95.5/90.1 |
90.7/88.0 |
91.3/91.1 |
96.5 |
95.3 |
69.5 |
91.0 |
92.6/94.6 |
92.3/- |
92.8/92.5 |
DeBERTa-XLarge1 |
-/- |
-/- |
91.5/91.2 |
97.0 |
- |
- |
93.1 |
92.1/94.3 |
- |
92.9/92.7 |
DeBERTa-V2-XLarge1 |
95.8/90.8 |
91.4/88.9 |
91.7/91.6 |
97.5 |
95.8 |
71.1 |
93.9 |
92.0/94.2 |
92.3/89.8 |
92.9/92.9 |
DeBERTa-V2-XXLarge1,2 |
96.1/91.4 |
92.2/89.7 |
91.7/91.9 |
97.2 |
96.0 |
72.0 |
93.5 |
93.1/94.9 |
92.7/90.3 |
93.2/93.1 |
Notes
đ License
No license information provided in the original document.
đ§ Technical Details
- Model Type: Decoding-enhanced BERT with Disentangled Attention
- Training Data: 80GB for the base model, 160GB for the DeBERTa V2 xxlarge model
Property |
Details |
Model Type |
Decoding-enhanced BERT with Disentangled Attention |
Training Data |
80GB for the base model, 160GB for the DeBERTa V2 xxlarge model |
đ Citation
If you find DeBERTa useful for your work, please cite the following paper:
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}
đĄ Usage Tip
To try the XXLarge model with HF transformers, we recommend using deepspeed as it's faster and saves memory.