ZSD-microsoft-v2xxlmnli Open-source Model - Enhance Text Comprehension and Support Efficient Completion of MNLI Tasks

ZSD Microsoft V2xxlmnli

Developed by NDugar

An enhanced BERT decoding model based on the decoupled attention mechanism, a large-scale version fine-tuned on the MNLI task.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Decoupled Attention Mechanism #Zero-shot Classification #Natural Language Understanding

Downloads 59

Release Time : 3/2/2022

Model Overview

DeBERTa improves the BERT architecture through an innovative decoupled attention mechanism and an enhanced masked decoder, achieving SOTA performance on multiple natural language understanding tasks. This version is specifically fine-tuned for the MNLI (Multi-Genre Natural Language Inference) task.

Model Features

Decoupled Attention Mechanism

Separately calculates content and position attention, significantly improving the model's understanding of complex language structures.

Enhanced Masked Decoder

An improved masked language modeling method that better captures the dependencies between words.

Cross-task Transfer Ability

After fine-tuning on MNLI, it can be directly transferred to similar tasks such as RTE/MRPC/STS-B.

Model Capabilities

Natural Language Inference

Text Classification

Semantic Similarity Calculation

Zero-shot Classification

Use Cases

Text Understanding

Multi-genre Text Inference

Determine the logical relationship (entailment/contradiction/neutral) between two texts.

Achieved 91.7/91.9 accuracy on the MNLI test set.

Semantic Similarity Analysis

Evaluate the semantic similarity between sentence pairs.

Achieved a Pearson correlation coefficient of 93.2 on the STS-B dataset.

Transfer Learning

Few-shot Task Adaptation

Quickly adapt to inference tasks such as RTE based on the MNLI fine-tuned model.

Achieved 93.5 accuracy on the RTE task.

🚀 DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa is a model that improves upon BERT and RoBERTa. It uses disentangled attention and an enhanced mask decoder. With 80GB of training data, it outperforms BERT and RoBERTa in most NLU tasks.

🚀 Quick Start

This is the DeBERTa large model fine - tuned with the MNLI task. For more details and updates, please check the official repository.

✨ Features

Fine - tuning on NLU tasks

We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model	SQuAD 1.1	SQuAD 2.0	MNLI - m/mm	SST - 2	QNLI	CoLA	RTE	MRPC	QQP	STS - B
	F1/EM	F1/EM	Acc	Acc	Acc	MCC	Acc	Acc/F1	Acc/F1	P/S
BERT - Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa - Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet - Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
[DeBERTa - Large](https://huggingface.co/microsoft/deberta - large)¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
[DeBERTa - XLarge](https://huggingface.co/microsoft/deberta - xlarge)¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
[DeBERTa - V2 - XLarge](https://huggingface.co/microsoft/deberta - v2 - xlarge)¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
[DeBERTa - V2 - XXLarge](https://huggingface.co/microsoft/deberta - v2 - xxlarge)^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1

Notes.

¹ Following RoBERTa, for RTE, MRPC, STS - B, we fine - tune the tasks based on [DeBERTa - Large - MNLI](https://huggingface.co/microsoft/deberta - large - mnli), [DeBERTa - XLarge - MNLI](https://huggingface.co/microsoft/deberta - xlarge - mnli), [DeBERTa - V2 - XLarge - MNLI](https://huggingface.co/microsoft/deberta - v2 - xlarge - mnli), [DeBERTa - V2 - XXLarge - MNLI](https://huggingface.co/microsoft/deberta - v2 - xxlarge - mnli). The results of SST - 2/QQP/QNLI/SQuADv2 will also be slightly improved when starting from MNLI fine - tuned models. However, we only report the numbers fine - tuned from pretrained base models for those 4 tasks.
² To try the XXLarge model with HF transformers, you need to specify --sharded_ddp

cd transformers/examples/text - classification/
export TASK_NAME=mrpc
python -m torch.distributed.launch --nproc_per_node=8 run_glue.py   --model_name_or_path microsoft/deberta - v2 - xxlarge   \
--task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 4   \
--learning_rate 3e - 6   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16

📚 Documentation

Citation

If you find DeBERTa useful for your work, please cite the following paper:

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING - ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご