QuaLA - MiniLM Open - Source Tiny Language Model, Accelerates SQuAD1.1 Dataset by 8.8 Times with Little Precision Loss

Dynamic Minilmv2 L6 H384 Squad1.1 Int8 Static

Developed by Intel

QuaLA-MiniLM is a compact language model developed by Intel, integrating knowledge distillation, length-adaptive transformers, and 8-bit quantization technology. It achieves up to 8.8x acceleration on the SQuAD1.1 dataset with less than 1% accuracy loss.

Large Language Model

Transformers

Open Source License:MIT #Length-adaptive inference #8-bit quantized model #Knowledge distillation optimization

Downloads 172

Release Time : 11/21/2022

Model Overview

This model enables efficient inference by dynamically adjusting computational resource allocation, suitable for natural language processing tasks that require balancing accuracy and efficiency.

Model Features

Dynamic Computation Allocation

Implements dynamic adjustment of token numbers per layer through LAT technology to adapt to varying computation budgets.

Efficient Quantization

Employs 8-bit quantization to reduce model size, with the quantized model being only 30% of the original size.

Knowledge Distillation

Distills knowledge from the RoBERTa-Large teacher model to maintain high accuracy in the compact model.

Model Capabilities

Text understanding

Question answering

Efficient inference

Use Cases

Intelligent Q&A

Wikipedia Content Q&A

Question answering applications based on the SQuAD1.1 dataset

Achieves 8.8x speedup while maintaining 87.68% F1 accuracy

Edge Computing

Mobile Q&A System

Deployment of efficient language models on resource-constrained devices

Quantized model size is only 84.86MB

🚀 Model Details: QuaLA-MiniLM

This model, named QuaLA-MiniLM, is developed through a novel approach that combines knowledge distillation, the length-adaptive transformer (LAT) technique, and low-bit quantization. It expands on the Dynamic-TinyBERT approach, training a single model that can adapt to any inference scenario within a given computational budget. This results in a superior accuracy-efficiency trade-off on the SQuAD1.1 dataset. The authors compare their method with other efficient techniques and find that it can achieve up to an x8.8 speedup with less than 1% accuracy loss. The code for this model is publicly available on GitHub. The article also covers other related work in the field, including dynamic transformers and various knowledge distillation approaches.

✨ Features

Combines knowledge distillation, LAT technique, and low-bit quantization.
Adapts to different inference scenarios with a given computational budget.
Achieves a high accuracy-efficiency trade-off on the SQuAD1.1 dataset.
Up to an x8.8 speedup with less than 1% accuracy loss.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import ...

📚 Documentation

QuaLA-MiniLM training process

To run the model with the best accuracy-efficiency tradeoff for a specific computational budget, we set the length configuration to the best setting found by an evolutionary search to match our computational constraint.

Property	Details
Language	en
Model Authors Company	Intel
Date	May 4, 2023
Version	1
Model Type	NLP - Tiny language model
Architecture	"In this work we expand Dynamic-TinyBERT to generate a much more highly efficient model. First, we use a much smaller MiniLM model which was distilled from a RoBERTa-Large teacher rather than BERT-base. Second, we apply the LAT method to make the model length-adaptive, and finally we further enhance the model’s efficiency by applying 8-bit quantization. The resultant QuaLAMiniLM (Quantized Length-Adaptive MiniLM) model outperforms BERT-base with only 30% of parameters, and demonstrates an accuracy-speedup tradeoff that is superior to any other efficiency approach (up to x8.8 speedup with <1% accuracy loss) on the challenging SQuAD1.1 benchmark. Following the concept presented by LAT, it provides a wide range of accuracy-efficiency tradeoff points while alleviating the need to retrain it for each point along the accuracy-efficiency curve."
Paper or Other Resources	https://arxiv.org/pdf/2210.17114.pdf
License	TBD
Questions or Comments	Intel DevHub Discord

Intended Use	Details
Primary intended uses	TBD
Primary intended users	Anyone who needs an efficient tiny language model for other downstream tasks.
Out-of-scope uses	The model should not be used to intentionally create hostile or alienating environments for people.

Metrics (Model Performance)

Inference performance on the SQuAD1.1 evaluation dataset. For all the length-adaptive (LA) models, we show the performance both of running the model without token-dropping, and of running the model in a token-dropping configuration according to the optimal length configuration found to meet our accuracy constraint.

Model	Model size (Mb)	Tokens per layer	Accuracy (F1)	Latency (ms)	FLOPs	Speedup
BERT-base	415.4723	(384,384,384,384,384,384)	88.5831	56.5679	3.53E+10	1x
TinyBERT-ours	253.2077	(384,384,384,384,384,384)	88.3959	32.4038	1.77E+10	1.74x
QuaTinyBERT-ours	132.0665	(384,384,384,384,384,384)	87.6755	15.5850 1.77E+10	3.63x
MiniLMv2-ours	115.0473	(384,384,384,384,384,384)	88.7016	18.2312	4.76E+09	3.10x
QuaMiniLMv2-ours	84.8602	(384,384,384,384,384,384)	88.5463	9.1466	4.76E+09	6.18x
LA-MiniLM	115.0473	(384,384,384,384,384,384)	89.2811	16.9900	4.76E+09	3.33x
LA-MiniLM	115.0473	(269, 253, 252, 202, 104, 34)	87.7637	11.4428	2.49E+09	4.94x
QuaLA-MiniLM	84.8596	(384,384,384,384,384,384)	88.8593	7.4443	4.76E+09	7.6x
QuaLA-MiniLM	84.8596	(315,251,242,159,142,33)	87.6828	6.4146	2.547E+09	8.8x

Training and Evaluation Data

Property	Details
Datasets	SQuAD1.1 dataset
Motivation	To build an efficient and accurate base model for several downstream language tasks.

Ethical Considerations

Ethical Considerations	Details
Data	SQuAD1.1 dataset
Human life	The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of labelled Wikipedia articles.
Mitigations	No additional risk mitigation strategies were considered during model development.
Risks and harms	Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al., 2021, and Bender et al., 2021). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Beyond this, the extent of the risks involved by using the model remain unknown.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. There are no additional caveats or recommendations for this model.

BibTeX entry and citation info

Comments	Description
Comments	In this version we added reference to the source code in the abstract. arXiv admin note: text overlap with arXiv:2111.09645
Subjects	Computation and Language (cs.CL)
Cite as	arXiv:2210.17114 [cs.CL]
-	(or arXiv:2210.17114v2 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2210.17114

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご