mdeberta-v3-base Open-source AI Model - Excellent for Multilingual Processing, Outstanding Performance in Cross-lingual Tasks

Mdeberta V3 Base

Developed by microsoft

mDeBERTa is the multilingual version of DeBERTa, employing ELECTRA-style pretraining and gradient-disentangled embedding sharing technology, demonstrating excellent performance in cross-lingual tasks like XNLI

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual understanding #Zero-shot transfer #Disentangled attention

Downloads 692.08k

Release Time : 3/2/2022

Model Overview

A multilingual pretrained model based on DeBERTa architecture, supporting 15 languages, outperforming comparable models like XLM-R in zero-shot cross-lingual transfer tasks

Model Features

Gradient-disentangled Embedding Sharing

Uses ELECTRA-style pretraining method to optimize embedding layer sharing efficiency through gradient disentanglement technology

Multilingual Support

Supports 15 languages, achieving an average accuracy of 79.8% on XNLI cross-lingual tasks

Disentangled Attention Mechanism

Improved attention mechanism separates content and positional information processing to enhance model understanding

Model Capabilities

Multilingual text understanding

Zero-shot cross-lingual transfer

Masked language modeling

Text classification

Use Cases

Cross-lingual natural language understanding

XNLI Zero-shot Transfer

Trained only with English data then tested on 14 other languages

Average accuracy 79.8%, surpassing XLM-R-base's 76.2%

Multilingual text processing

Multilingual Text Classification

Supports text classification tasks in 15 languages

🚀 DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

DeBERTaV3 enhances the efficiency of DeBERTa by leveraging ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing, significantly improving model performance on downstream tasks.

DeBERTa enhances the BERT and RoBERTa models through disentangled attention and an enhanced mask decoder. With these two improvements, DeBERTa outperforms RoBERTa on a majority of NLU tasks using 80GB of training data.

In DeBERTa V3, we further boost the efficiency of DeBERTa by using ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing. Compared to DeBERTa, our V3 version notably improves the model's performance on downstream tasks. You can find more technical details about the new model in our paper.

Please refer to the official repository for more implementation details and updates.

mDeBERTa is the multilingual version of DeBERTa, which has the same structure as DeBERTa and was trained with CC100 multilingual data. The mDeBERTa V3 base model has 12 layers and a hidden size of 768. It has 86M backbone parameters and a vocabulary containing 250K tokens, which introduces 190M parameters in the Embedding layer. This model was trained using 2.5T of CC100 data, similar to XLM-R.

✨ Features

Enhanced Performance: DeBERTaV3 significantly improves the performance on downstream NLU tasks compared to DeBERTa.
Multilingual Support: mDeBERTa provides multilingual capabilities, trained with CC100 multilingual data.

📦 Installation

The README doesn't provide specific installation steps, so this section is skipped.

💻 Usage Examples

Fine-tuning on NLU tasks

We present the dev results on XNLI with zero-shot cross-lingual transfer setting, i.e., training with English data only and testing on other languages.

Property	Details
Model Type	mDeBERTa-base
Training Data	English data for training, tested on multiple languages (en, fr, es, de, el, bg, ru, tr, ar, vi, th, zh, hi, sw, ur)

Model	avg	en	fr	es	de	el	bg	ru	tr	ar	vi	th	zh	hi	sw	ur
XLM-R-base	76.2	85.8	79.7	80.7	78.7	77.5	79.6	78.1	74.2	73.8	76.5	74.6	76.7	72.4	66.5	68.3
mDeBERTa-base	79.8+/-0.2	88.2	82.6	84.4	82.7	82.3	82.4	80.8	79.5	78.5	78.1	76.4	79.5	75.9	73.9	72.4

Fine-tuning with HF transformers

#!/bin/bash

cd transformers/examples/pytorch/text-classification/

pip install datasets

output_dir="ds_results"

num_gpus=8

batch_size=4

python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
  run_xnli.py \
  --model_name_or_path microsoft/mdeberta-v3-base \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --train_language en \
  --language en \
  --evaluation_strategy steps \
  --max_seq_length 256 \
  --warmup_steps 3000 \
  --per_device_train_batch_size ${batch_size} \
  --learning_rate 2e-5 \
  --num_train_epochs 6 \
  --output_dir $output_dir \
  --overwrite_output_dir \
  --logging_steps 1000 \
  --logging_dir $output_dir

📚 Documentation

The README provides detailed information about the model, its improvements, and fine-tuning results. You can find more technical details in the paper.

📄 License

This project is licensed under the MIT license.

📖 Citation

If you find DeBERTa useful for your work, please cite the following papers:

@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご