🚀 INT8 DistilBERT Base Uncased Fine-Tuned on SQuAD
This model is an INT8 quantized version of DistilBERT base uncased, fine - tuned on the Stanford Question Answering Dataset (SQuAD). Quantization was performed using Hugging Face's Optimum - Intel with the Intel® Neural Compressor, aiming to optimize size and inference speed while maintaining accuracy.
✨ Features
- Quantization: Converted from FP32 to INT8 using post - training static quantization.
- Optimized for QA: Designed for question - answering tasks with fast inference and reduced model size.
- Multi - framework Support: Available in both PyTorch and ONNX versions.
📦 Installation
No specific installation steps provided in the original document.
💻 Usage Examples
Basic Usage
from optimum.intel import INCModelForQuestionAnswering
model_id = "Intel/distilbert-base-uncased-distilled-squad-int8-static"
int8_model = INCModelForQuestionAnswering.from_pretrained(model_id)
Advanced Usage
from optimum.onnxruntime import ORTModelForQuestionAnswering
model = ORTModelForQuestionAnswering.from_pretrained('Intel/distilbert-base-uncased-distilled-squad-int8-static')
📚 Documentation
Model Details
Property |
Details |
Model Authors |
Xin He Zixuan Cheng Yu Wenz |
Date |
Aug 4, 2022 |
Version |
The base model for this quantization process was distilbert - base - uncased - distilled - squad, a distilled version of BERT designed for the question - answering task. |
Model Type |
Language Model |
Paper or Other Resources |
Base Model: [distilbert - base - uncased - distilled - squad](https://huggingface.co/distilbert/distilbert - base - uncased - distilled - squad) |
License |
apache - 2.0 |
Questions or Comments |
[Community Tab](https://huggingface.co/Intel/distilbert - base - uncased - distilled - squad - int8 - static - inc/discussions) and Intel DevHub Discord |
Quantization Details |
The model underwent post - training static quantization to convert it from its original FP32 precision to INT8, optimizing for size and inference speed while aiming to retain as much of the original model's accuracy as possible. |
Calibration Details |
For PyTorch, the calibration dataloader was the train dataloader with a real sampling size of 304 due to the default calibration sampling size of 300 not being exactly divisible by the batch size of 8. For the ONNX version, the calibration was performed using the eval dataloader with a default calibration sampling size of 100. |
Intended Use
Property |
Details |
Primary Intended Uses |
This model is intended for question - answering tasks, where it can provide answers to questions given a context passage. It is optimized for scenarios requiring fast inference and reduced model size without significantly compromising accuracy. |
Primary Intended Users |
Researchers, developers, and enterprises that require efficient, low - latency question answering capabilities in their applications, particularly where computational resources are limited. |
Out - of - scope Uses |
|
Evaluation
PyTorch Version
This is an INT8 PyTorch model quantized with [huggingface/optimum - intel](https://github.com/huggingface/optimum - intel) through the usage of [Intel® Neural Compressor](https://github.com/intel/neural - compressor).
|
INT8 |
FP32 |
Accuracy (eval - f1) |
86.1069 |
86.8374 |
Model size (MB) |
74.7 |
265 |
ONNX Version
This is an INT8 ONNX model quantized with [Intel® Neural Compressor](https://github.com/intel/neural - compressor).
|
INT8 |
FP32 |
Accuracy (eval - f1) |
0.8633 |
0.8687 |
Model size (MB) |
154 |
254 |
🔧 Technical Details
The model quantization was achieved using post - training static quantization. For PyTorch, specific calibration dataloader settings were used due to sampling size and batch - size constraints. The ONNX version used a different calibration dataloader with a default sampling size.
📄 License
This model is licensed under the apache - 2.0 license.
⚠️ Important Note
Users should be aware of potential biases present in the training data (SQuAD and Wikipedia), and consider the implications of these biases on the model's outputs. Additionally, quantization may introduce or exacerbate biases in certain scenarios.
💡 Usage Tip
- Users should consider the balance between performance and accuracy when deploying quantized models in critical applications.
- Further fine - tuning or calibration may be necessary for specific use cases or to meet stricter accuracy requirements.