🚀 BERT-base uncased model fine-tuned on SQuAD v1
This is a BERT-base uncased model fine-tuned on the SQuAD v1 dataset. It uses pruning techniques to reduce the number of weights and speed up inference while maintaining high accuracy.
✨ Features
- Pruned Weights: The linear layers contain 27.0% of the original weights, and the model contains 43.0% of the original weights overall.
- Faster Inference: With a simple resizing of the linear matrices, it ran 1.96x as fast as bert-base-uncased on the evaluation.
- High Accuracy: Its F1 score is 88.33, with only a 0.17 drop compared to bert-base-uncased.
📦 Installation
Install nn_pruning
: it contains the optimization script, which just pack the linear layers into smaller ones by removing empty rows/columns.
pip install nn_pruning
💻 Usage Examples
Basic Usage
from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model
qa_pipeline = pipeline(
"question-answering",
model="madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1",
tokenizer="madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1"
)
print("bert-base-uncased parameters: 191.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")
print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)
📚 Documentation
Fine-Pruning details
This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model bert-large-uncased-whole-word-masking-finetuned-squad. This model is case-insensitive: it does not make a difference between english and English.
A side-effect of the block pruning is that some of the attention heads are completely removed: 55 heads were removed on a total of 144 (38.2%). Here is a detailed view on how the remaining heads are distributed in the network after pruning.
Details of the SQuAD1.1 dataset
Property |
Details |
Dataset |
SQuAD1.1 |
Train Split Samples |
90.6K |
Eval Split Samples |
11.1k |
Fine-tuning
- Python Version:
3.8.5
- Machine Specs:
CPU: Intel(R) Core(TM) i7-6700K CPU
Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1
Results
Metric |
Value |
Original (Table 2) |
Variation |
EM |
81.31 |
80.8 |
+0.51 |
F1 |
88.33 |
88.5 |
-0.17 |
Model File Size
Pytorch model file size: 374MB
(original BERT: 420MB
)
🔧 Technical Details
This model was created using the nn_pruning python library. It uses ReLUs instead of GeLUs as in the initial BERT network, to speedup inference. This does not need special handling, as it is supported by the Transformers library, and flagged in the model config by the "hidden_act": "relu"
entry.
This model CANNOT be used without using nn_pruning optimize_model
function, as it uses NoNorms instead of LayerNorms and this is not currently supported by the Transformers library.
📄 License
This model is released under the MIT license.