DeBERTa-v3-large Open-source Natural Language Processing Model - Free to Aid in Efficient Text Understanding Tasks

Deberta V3 Large

Developed by microsoft

DeBERTaV3 improves upon DeBERTa with ELECTRA-style pre-training and gradient-disentangled embedding sharing techniques, excelling in natural language understanding tasks

Large Language Model

Transformers

EnglishOpen Source License:MIT #ELECTRA-style pre-training #Gradient-disentangled embedding #Natural language understanding

Downloads 343.39k

Release Time : 3/2/2022

Model Overview

DeBERTaV3 is a large language model based on the DeBERTa architecture, featuring disentangled attention mechanisms and enhanced masked decoders. It employs an ELECTRA-style pre-training framework for improved efficiency and is suitable for various natural language understanding tasks

Model Features

ELECTRA-style pre-training

Uses the more efficient ELECTRA pre-training framework instead of traditional MLM to enhance training efficiency

Gradient-disentangled embedding sharing

Innovatively disentangles the gradient sharing mechanism in embedding layers to optimize model parameter learning

Disentangled attention mechanism

Decomposes the attention mechanism into separate content and position matrices to enhance model comprehension

Enhanced masked decoder

Improved masked language model decoder for better capture of contextual dependencies

Model Capabilities

Text classification

Question answering systems

Natural language inference

Semantic understanding

Use Cases

Natural language processing

Question answering system

Used to build high-precision question answering systems, such as SQuAD 2.0 tasks

F1 score 91.5, EM score 89.0

Text classification

Applied to natural language inference tasks like MNLI

Accuracy 91.8/91.9 (matched/mismatched)

🚀 DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

DeBERTaV3 enhances the efficiency of DeBERTa through ELECTRA-Style pre - training with Gradient Disentangled Embedding Sharing, significantly boosting performance on downstream NLU tasks.

🚀 Quick Start

If you're interested in the implementation details and updates of DeBERTaV3, please visit the official repository.

✨ Features

Improved Architecture: DeBERTa enhances BERT and RoBERTa models with disentangled attention and an enhanced mask decoder. With 80GB of training data, it outperforms RoBERTa in most NLU tasks.
Efficient Pre - training: In DeBERTa V3, we further improve efficiency using ELECTRA - Style pre - training with Gradient Disentangled Embedding Sharing, leading to better performance on downstream tasks compared to DeBERTa.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Fine - tuning on NLU tasks

We present the development results on SQuAD 2.0 and MNLI tasks:

Property	Details
Model Type	RoBERTa - large, XLNet - large, DeBERTa - large, DeBERTa - v3 - large
Vocabulary(K)	50 (RoBERTa - large, DeBERTa - large), 32 (XLNet - large), 128 (DeBERTa - v3 - large)
Backbone #Params(M)	304 (RoBERTa - large, DeBERTa - v3 - large)
SQuAD 2.0(F1/EM)	89.4/86.5 (RoBERTa - large), 90.6/87.9 (XLNet - large), 90.7/88.0 (DeBERTa - large), 91.5/89.0 (DeBERTa - v3 - large)
MNLI - m/mm(ACC)	90.2 (RoBERTa - large), 90.8 (XLNet - large), 91.3 (DeBERTa - large), 91.8/91.9 (DeBERTa - v3 - large)

Fine - tuning with HF transformers

#!/bin/bash

cd transformers/examples/pytorch/text - classification/

pip install datasets
export TASK_NAME=mnli

output_dir="ds_results"

num_gpus=8

batch_size=8

python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
  run_glue.py \
  --model_name_or_path microsoft/deberta - v3 - large \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --evaluation_strategy steps \
  --max_seq_length 256 \
  --warmup_steps 50 \
  --per_device_train_batch_size ${batch_size} \
  --learning_rate 6e - 6 \
  --num_train_epochs 2 \
  --output_dir $output_dir \
  --overwrite_output_dir \
  --logging_steps 1000 \
  --logging_dir $output_dir

📚 Documentation

The DeBERTa V3 large model consists of 24 layers with a hidden size of 1024. It has 304M backbone parameters and a vocabulary of 128K tokens, introducing 131M parameters in the Embedding layer. This model was trained using 160GB of data, similar to DeBERTa V2.

🔧 Technical Details

You can find more technical details about the new model from our paper.

📄 License

This project is licensed under the MIT license.

Citation

If you find DeBERTa useful for your work, please cite the following papers:

@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA - Style Pre - Training with Gradient - Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING - ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご