# The Open-Source v3large-2epoch DeBERTa Model: Strong at Natural Language Understanding, Outperforming BERT and RoBERTa

V3large 2epoch

Developed by NDugar

DeBERTa is an enhanced BERT improvement model based on the disentangled attention mechanism. With 160GB of training data and 1.5 billion parameters, it surpasses the performance of BERT and RoBERTa in multiple natural language understanding tasks.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Disentangled Attention Mechanism #Zero-shot Classification #Natural Language Understanding

Downloads 31

Release Time : 3/2/2022

Model Overview

DeBERTa improves the BERT architecture through disentangled attention and enhanced masked decoders, making it particularly suitable for natural language understanding tasks and achieving excellent performance on the GLUE benchmark.

Model Features

Disentangled Attention Mechanism

Enhances the model's ability to understand textual relationships by separating content and position attention calculations.

Enhanced Masked Decoder

Improved masked language modeling objective, enhancing the model's contextual modeling capabilities.

Large-scale Pretraining

Pretrained on 160GB of raw text data with a parameter scale of 1.5 billion.

Model Capabilities

Text Classification

Natural Language Inference

Question Answering System

Semantic Similarity Calculation

Sentence Pair Classification

Use Cases

Text Understanding

Multi-genre Natural Language Inference

Determine the logical relationship between two texts (entailment/contradiction/neutral).

Achieves 91.7/91.9 accuracy on the MNLI dataset.

Sentiment Analysis

Analyze text sentiment (positive/negative).

Achieves 97.2% accuracy on the SST-2 dataset.

Question Answering System

Machine Reading Comprehension

Answer related questions based on given text.

Achieves 92.2/89.7 F1/EM scores on SQuAD 2.0.

🚀 DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder, offering better performance on most NLU tasks.

🚀 Quick Start

DeBERTa enhances the BERT and RoBERTa models through disentangled attention and an enhanced mask decoder. With 80GB of training data, it outperforms BERT and RoBERTa on the majority of NLU tasks.

For more details and updates, please visit the official repository.

This is the DeBERTa V2 xxlarge model, featuring 48 layers and a hidden size of 1536. It has a total of 1.5B parameters and is trained with 160GB of raw data.

✨ Features

Fine-tuning on NLU tasks

We present the development results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model	SQuAD 1.1	SQuAD 2.0	MNLI-m/mm	SST-2	QNLI	CoLA	RTE	MRPC	QQP	STS-B
	F1/EM	F1/EM	Acc	Acc	Acc	MCC	Acc	Acc/F1	Acc/F1	P/S
BERT-Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa-Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa-Large¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa-XLarge¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa-V2-XLarge¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa-V2-XXLarge^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1

Notes.

¹ Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on DeBERTa-Large-MNLI, DeBERTa-XLarge-MNLI, DeBERTa-V2-XLarge-MNLI, DeBERTa-V2-XXLarge-MNLI. The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when starting from MNLI fine-tuned models. However, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.
² To try the XXLarge model with HF transformers, we recommend using deepspeed as it's faster and saves memory.

💻 Usage Examples

Basic Usage

Run with Deepspeed,

pip install datasets
pip install deepspeed
# Download the deepspeed config file
wget https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/ds_config.json -O ds_config.json
export TASK_NAME=mnli
output_dir="ds_results"
num_gpus=8
batch_size=8
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \\
  run_glue.py \\
  --model_name_or_path microsoft/deberta-v2-xxlarge \\
  --task_name $TASK_NAME \\
  --do_train \\
  --do_eval \\
  --max_seq_length 256 \\
  --per_device_train_batch_size ${batch_size} \\
  --learning_rate 3e-6 \\
  --num_train_epochs 3 \\
  --output_dir $output_dir \\
  --overwrite_output_dir \\
  --logging_steps 10 \\
  --logging_dir $output_dir \\
  --deepspeed ds_config.json

Advanced Usage

You can also run with --sharded_ddp

cd transformers/examples/text-classification/
export TASK_NAME=mnli
python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta-v2-xxlarge   \\
--task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 256   --per_device_train_batch_size 8   \\
--learning_rate 3e-6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16

📄 License

This project is under the MIT license.

📚 Documentation

Citation

If you find DeBERTa useful for your work, please cite the following paper:

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご