DeBERTa-v2-xlarge Open-source Natural Language Understanding Model - Free to Improve Performance on Multiple Language Tasks

Deberta V2 Xlarge

Developed by kamalkraj

DeBERTa is an enhanced BERT decoding model based on the disentangled attention mechanism, surpassing the performance of BERT and RoBERTa on multiple natural language understanding tasks through improved attention mechanisms and enhanced masked decoders.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Disentangled Attention Mechanism #Natural Language Understanding #Large-scale Pretraining

Downloads 302

Release Time : 3/2/2022

Model Overview

DeBERTa is an improved BERT model that enhances performance on natural language understanding tasks through disentangled attention mechanisms and enhanced masked decoders. The model was trained on 160GB of data, featuring a 24-layer network architecture with a 1536-dimensional hidden layer size, totaling 900 million parameters.

Model Features

Disentangled Attention Mechanism

Effectively captures dependencies in text by separating content and positional attention calculations.

Enhanced Masked Decoder

Improved masked language modeling method enhances the model's contextual understanding.

Large-scale Pretraining

Trained on 160GB of raw data, providing robust language representation capabilities.

Model Capabilities

Text Understanding

Question Answering Systems

Text Classification

Natural Language Inference

Semantic Similarity Calculation

Use Cases

Natural Language Processing

Question Answering System

Build high-performance question answering systems, such as for SQuAD tasks.

Achieved F1/EM scores of 91.4/89.7 on SQuAD 2.0.

Text Classification

Used for text classification tasks like sentiment analysis.

Achieved 97.5% accuracy on the SST-2 sentiment analysis task.

Natural Language Inference

Determines the logical relationship between two pieces of text.

Achieved 91.7/91.9 accuracy on the MNLI task.

🚀 DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa improves the BERT and RoBERTa models using disentangled attention and an enhanced mask decoder. It outperforms BERT and RoBERTa on the majority of NLU tasks with 80GB of training data.

Please check the official repository for more details and updates.

This is the DeBERTa V2 xlarge model with 24 layers and a 1536 hidden size. The total number of parameters is 900M, and it is trained with 160GB of raw data.

✨ Features

Utilizes disentangled attention and an enhanced mask decoder to improve upon BERT and RoBERTa.
Achieves superior performance on most NLU tasks with a large amount of training data.

📚 Documentation

Fine-tuning on NLU tasks

We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model	SQuAD 1.1	SQuAD 2.0	MNLI-m/mm	SST-2	QNLI	CoLA	RTE	MRPC	QQP	STS-B
	F1/EM	F1/EM	Acc	Acc	Acc	MCC	Acc	Acc/F1	Acc/F1	P/S
BERT-Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa-Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa-Large¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa-XLarge¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa-V2-XLarge¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa-V2-XXLarge^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1

Notes.

¹ Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on DeBERTa-Large-MNLI, DeBERTa-XLarge-MNLI, DeBERTa-V2-XLarge-MNLI, DeBERTa-V2-XXLarge-MNLI. The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when starting from MNLI fine-tuned models; however, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.
² To try the XXLarge model with HF transformers, you need to specify --sharded_ddp

cd transformers/examples/text-classification/
export TASK_NAME=mrpc
python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta-v2-xxlarge   \
--task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 4   \
--learning_rate 3e-6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16

Citation

If you find DeBERTa useful for your work, please cite the following paper:

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご