Deberta-large Open-source Natural Language Understanding Model - Free Deployment, Outperforming BERT and RoBERTa

Deberta Large

Developed by microsoft

DeBERTa is an improved BERT model that enhances performance through a disentangled attention mechanism and an enhanced masked decoder, surpassing BERT and RoBERTa in multiple natural language understanding tasks.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Disentangled Attention Mechanism #Enhanced Masked Decoding #Natural Language Understanding

Downloads 15.07k

Release Time : 3/2/2022

Model Overview

DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT architecture with a disentangled attention mechanism and an enhanced masked decoder, excelling particularly in natural language understanding tasks.

Model Features

Disentangled Attention Mechanism

Decouples content and positional information in the attention mechanism, enhancing the model's understanding of semantic and positional relationships.

Enhanced Masked Decoder

Improved masked language modeling objective function that better captures the contextual dependencies of masked tokens.

Large-scale Pre-training

Pre-trained on 80GB of training data to learn richer language representations.

Model Capabilities

Text Classification

Question Answering Systems

Natural Language Inference

Semantic Similarity Calculation

Linguistic Acceptability Judgment

Use Cases

Academic Research

GLUE Benchmark Testing

Achieves state-of-the-art performance on the General Language Understanding Evaluation benchmark.

Surpasses BERT and RoBERTa on tasks such as MNLI, SST-2, and QNLI.

Industrial Applications

Intelligent Customer Service

Used for question-answering systems and intent recognition.

Achieves an F1 score of 92.2 on the SQuAD 2.0 question-answering task.

🚀 DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa improves the BERT and RoBERTa models using disentangled attention and an enhanced mask decoder. It outperforms BERT and RoBERTa on the majority of NLU tasks with 80GB of training data.

Please check the official repository for more details and updates.

🚀 Quick Start

The following sections provide fine - tuning results on NLU tasks and usage notes for different models.

✨ Features

Disentangled Attention: DeBERTa uses disentangled attention to improve the performance of the model.
Enhanced Mask Decoder: It has an enhanced mask decoder to enhance the decoding ability.
High Performance: Outperforms BERT and RoBERTa on most NLU tasks.

📚 Documentation

Fine - tuning on NLU tasks

We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model	SQuAD 1.1	SQuAD 2.0	MNLI - m/mm	SST - 2	QNLI	CoLA	RTE	MRPC	QQP	STS - B
	F1/EM	F1/EM	Acc	Acc	Acc	MCC	Acc	Acc/F1	Acc/F1	P/S
BERT - Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa - Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet - Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
[DeBERTa - Large](https://huggingface.co/microsoft/deberta - large)¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
[DeBERTa - XLarge](https://huggingface.co/microsoft/deberta - xlarge)¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
[DeBERTa - V2 - XLarge](https://huggingface.co/microsoft/deberta - v2 - xlarge)¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
[DeBERTa - V2 - XXLarge](https://huggingface.co/microsoft/deberta - v2 - xxlarge)^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1

Notes

¹ Following RoBERTa, for RTE, MRPC, STS - B, we fine - tune the tasks based on [DeBERTa - Large - MNLI](https://huggingface.co/microsoft/deberta - large - mnli), [DeBERTa - XLarge - MNLI](https://huggingface.co/microsoft/deberta - xlarge - mnli), [DeBERTa - V2 - XLarge - MNLI](https://huggingface.co/microsoft/deberta - v2 - xlarge - mnli), [DeBERTa - V2 - XXLarge - MNLI](https://huggingface.co/microsoft/deberta - v2 - xxlarge - mnli). The results of SST - 2/QQP/QNLI/SQuADv2 will also be slightly improved when starting from MNLI fine - tuned models. However, we only report the numbers fine - tuned from pretrained base models for those 4 tasks.
² To try the XXLarge model with HF transformers, you need to specify --sharded_ddp

cd transformers/examples/text - classification/
export TASK_NAME=mrpc
python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta - v2 - xxlarge   \
--task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 4   \
--learning_rate 3e - 6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16

Citation

If you find DeBERTa useful for your work, please cite the following paper:

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING - ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご