Multilingual-MiniLM-L12-H384 Open-Source Language Model - Supports Multilingual Understanding and Generation Tasks

Multilingual MiniLM L12 H384

Developed by microsoft

MiniLM is a compact and efficient pre-trained language model that compresses Transformer models through deep self-attention distillation technology, supporting multilingual understanding and generation tasks.

Large Language Model Supports Multiple LanguagesOpen Source License:MIT #Multilingual understanding #Lightweight Transformer #Cross-lingual transfer

Downloads 28.51k

Release Time : 3/2/2022

Model Overview

MiniLM is a lightweight multilingual model based on the Transformer architecture. It retains the performance of the original large model through knowledge distillation technology while significantly reducing the parameter size, suitable for cross-lingual text classification, question answering, and other tasks.

Model Features

Efficient knowledge distillation

Compresses the original Transformer model through deep self-attention distillation technology while retaining core language understanding capabilities.

Multilingual support

Supports cross-lingual transfer learning for 16 languages, using the same tokenizer as XLM-R.

Lightweight architecture

Only 12 Transformer layers with 384 hidden units, significantly smaller in parameter size compared to similar multilingual models.

Model Capabilities

Cross-lingual text classification

Cross-lingual question answering

Natural language inference

Multilingual text understanding

Use Cases

Cross-lingual text classification

XNLI cross-lingual natural language inference

Transfer English-trained models to 15 other languages for textual entailment judgment

Achieves an average accuracy of 71.1% on the XNLI benchmark, outperforming mBERT models of similar size.

Question answering systems

MLQA cross-lingual question answering

Transfer English-trained QA models to other languages

Achieves an F1 score of 63.2% on the MLQA benchmark, approaching the performance of the larger XLM-R Base model.

🚀 MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation

MiniLM is a distilled model designed to offer efficient solutions for language understanding and generation tasks. It is derived from the paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers".

For comprehensive details regarding preprocessing, training, and other aspects of MiniLM, please refer to the original MiniLM repository.

⚠️ Important Note

This checkpoint uses BertModel with XLMRobertaTokenizer, so AutoTokenizer is not compatible with this checkpoint!

✨ Features

Multilingual Pretrained Model

Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters

Multilingual MiniLM employs the same tokenizer as XLM-R. However, its Transformer architecture is similar to that of BERT. We provide fine-tuning code for XNLI based on huggingface/transformers. To fine-tune multilingual MiniLM, replace run_xnli.py in transformers with ours.

We evaluate the multilingual MiniLM on two benchmarks: the cross-lingual natural language inference benchmark (XNLI) and the cross-lingual question answering benchmark (MLQA).

Cross-Lingual Natural Language Inference - XNLI

We assess our model's cross-lingual transfer capabilities from English to other languages. Following Conneau et al. (2019), we select the best single model on the joint dev set of all languages.

Model	#Layers	#Hidden	#Transformer Parameters	Average	en	fr	es	de	el	bg	ru	tr	ar	vi	th	zh	hi	sw	ur
mBERT	12	768	85M	66.3	82.1	73.8	74.3	71.1	66.4	68.9	69.0	61.6	64.9	69.5	55.8	69.3	60.0	50.4	58.0
XLM-100	16	1280	315M	70.7	83.2	76.7	77.7	74.0	72.7	74.1	72.7	68.7	68.6	72.9	68.9	72.5	65.6	58.2	62.4
XLM-R Base	12	768	85M	74.5	84.6	78.4	78.9	76.8	75.9	77.3	75.4	73.2	71.5	75.4	72.5	74.9	71.1	65.2	66.5
mMiniLM-L12xH384	12	384	21M	71.1	81.5	74.8	75.7	72.9	73.0	74.5	71.3	69.7	68.8	72.1	67.8	70.0	66.2	63.3	64.2

💻 Usage Examples

Basic Usage

This example code demonstrates how to fine-tune the 12-layer multilingual MiniLM on XNLI.

# run fine-tuning on XNLI
DATA_DIR=/{path_of_data}/
OUTPUT_DIR=/{path_of_fine-tuned_model}/
MODEL_PATH=/{path_of_pre-trained_model}/

python ./examples/run_xnli.py --model_type minilm \
 --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
 --model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
 --tokenizer_name xlm-roberta-base \
 --config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
 --do_train \
 --do_eval \
 --max_seq_length 128 \
 --per_gpu_train_batch_size 128 \
 --learning_rate 5e-5 \
 --num_train_epochs 5 \
 --per_gpu_eval_batch_size 32 \
 --weight_decay 0.001 \
 --warmup_steps 500 \
 --save_steps 1500 \
 --logging_steps 1500 \
 --eval_all_checkpoints \
 --language en \
 --fp16 \
 --fp16_opt_level O2

Cross-Lingual Question Answering - MLQA

Following Lewis et al. (2019b), we use SQuAD 1.1 as training data and MLQA English development data for early stopping.

Model F1 Score	#Layers	#Hidden	#Transformer Parameters	Average	en	es	de	ar	hi	vi	zh
mBERT	12	768	85M	57.7	77.7	64.3	57.9	45.7	43.8	57.1	57.5
XLM-15	12	1024	151M	61.6	74.9	68.0	62.2	54.8	48.8	61.4	61.1
XLM-R Base (Reported)	12	768	85M	62.9	77.8	67.2	60.8	53.0	57.9	63.1	60.2
XLM-R Base (Our fine-tuned)	12	768	85M	64.9	80.3	67.0	62.7	55.0	60.4	66.5	62.3
mMiniLM-L12xH384	12	384	21M	63.2	79.4	66.1	61.2	54.9	58.5	63.1	59.0

📄 License

This project is licensed under the MIT license.

📚 Documentation

Citation

If you find MiniLM useful in your research, please cite the following paper:

@misc{wang2020minilm,
    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
    year={2020},
    eprint={2002.10957},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご