🚀 Model Details: QuaLA-MiniLM
This model, named QuaLA-MiniLM, is developed through a novel approach that combines knowledge distillation, the length-adaptive transformer (LAT) technique, and low-bit quantization. It expands on the Dynamic-TinyBERT approach, training a single model that can adapt to any inference scenario within a given computational budget. This results in a superior accuracy-efficiency trade-off on the SQuAD1.1 dataset. The authors compare their method with other efficient techniques and find that it can achieve up to an x8.8 speedup with less than 1% accuracy loss. The code for this model is publicly available on GitHub. The article also covers other related work in the field, including dynamic transformers and various knowledge distillation approaches.
✨ Features
- Combines knowledge distillation, LAT technique, and low-bit quantization.
- Adapts to different inference scenarios with a given computational budget.
- Achieves a high accuracy-efficiency trade-off on the SQuAD1.1 dataset.
- Up to an x8.8 speedup with less than 1% accuracy loss.
📦 Installation
No installation steps were provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
import ...
📚 Documentation
QuaLA-MiniLM training process
To run the model with the best accuracy-efficiency tradeoff for a specific computational budget, we set the length configuration to the best setting found by an evolutionary search to match our computational constraint.
Property |
Details |
Language |
en |
Model Authors Company |
Intel |
Date |
May 4, 2023 |
Version |
1 |
Model Type |
NLP - Tiny language model |
Architecture |
"In this work we expand Dynamic-TinyBERT to generate a much more highly efficient model. First, we use a much smaller MiniLM model which was distilled from a RoBERTa-Large teacher rather than BERT-base. Second, we apply the LAT method to make the model length-adaptive, and finally we further enhance the model’s efficiency by applying 8-bit quantization. The resultant QuaLAMiniLM (Quantized Length-Adaptive MiniLM) model outperforms BERT-base with only 30% of parameters, and demonstrates an accuracy-speedup tradeoff that is superior to any other efficiency approach (up to x8.8 speedup with <1% accuracy loss) on the challenging SQuAD1.1 benchmark. Following the concept presented by LAT, it provides a wide range of accuracy-efficiency tradeoff points while alleviating the need to retrain it for each point along the accuracy-efficiency curve." |
Paper or Other Resources |
https://arxiv.org/pdf/2210.17114.pdf |
License |
TBD |
Questions or Comments |
Intel DevHub Discord |
Intended Use |
Details |
Primary intended uses |
TBD |
Primary intended users |
Anyone who needs an efficient tiny language model for other downstream tasks. |
Out-of-scope uses |
The model should not be used to intentionally create hostile or alienating environments for people. |
Metrics (Model Performance)
Inference performance on the SQuAD1.1 evaluation dataset. For all the length-adaptive (LA) models, we show the performance both of running the model without token-dropping, and of running the model in a token-dropping configuration according to the optimal length configuration found to meet our accuracy constraint.
Model |
Model size (Mb) |
Tokens per layer |
Accuracy (F1) |
Latency (ms) |
FLOPs |
Speedup |
BERT-base |
415.4723 |
(384,384,384,384,384,384) |
88.5831 |
56.5679 |
3.53E+10 |
1x |
TinyBERT-ours |
253.2077 |
(384,384,384,384,384,384) |
88.3959 |
32.4038 |
1.77E+10 |
1.74x |
QuaTinyBERT-ours |
132.0665 |
(384,384,384,384,384,384) |
87.6755 |
15.5850 1.77E+10 |
3.63x |
|
MiniLMv2-ours |
115.0473 |
(384,384,384,384,384,384) |
88.7016 |
18.2312 |
4.76E+09 |
3.10x |
QuaMiniLMv2-ours |
84.8602 |
(384,384,384,384,384,384) |
88.5463 |
9.1466 |
4.76E+09 |
6.18x |
LA-MiniLM |
115.0473 |
(384,384,384,384,384,384) |
89.2811 |
16.9900 |
4.76E+09 |
3.33x |
LA-MiniLM |
115.0473 |
(269, 253, 252, 202, 104, 34) |
87.7637 |
11.4428 |
2.49E+09 |
4.94x |
QuaLA-MiniLM |
84.8596 |
(384,384,384,384,384,384) |
88.8593 |
7.4443 |
4.76E+09 |
7.6x |
QuaLA-MiniLM |
84.8596 |
(315,251,242,159,142,33) |
87.6828 |
6.4146 |
2.547E+09 |
8.8x |
Training and Evaluation Data
Property |
Details |
Datasets |
SQuAD1.1 dataset |
Motivation |
To build an efficient and accurate base model for several downstream language tasks. |
Ethical Considerations
Ethical Considerations |
Details |
Data |
SQuAD1.1 dataset |
Human life |
The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of labelled Wikipedia articles. |
Mitigations |
No additional risk mitigation strategies were considered during model development. |
Risks and harms |
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al., 2021, and Bender et al., 2021). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Beyond this, the extent of the risks involved by using the model remain unknown. |
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. There are no additional caveats or recommendations for this model.
BibTeX entry and citation info
Comments |
Description |
Comments |
In this version we added reference to the source code in the abstract. arXiv admin note: text overlap with arXiv:2111.09645 |
Subjects |
Computation and Language (cs.CL) |
Cite as |
arXiv:2210.17114 [cs.CL] |
- |
(or arXiv:2210.17114v2 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2210.17114 |
📄 License
The model is released under the MIT license.