🚀 BERT-base uncased model fine-tuned on SQuAD v1
This project presents a fine-tuned BERT-base uncased model on the SQuAD v1 dataset. It leverages pruning techniques to reduce the model size and speed up inference while maintaining high accuracy.
🚀 Quick Start
To start using this model, you first need to install the nn_pruning
library. This library contains the optimization script that packs the linear layers into smaller ones by removing empty rows/columns.
pip install nn_pruning
Then, you can use the transformers library
almost as usual. You just have to call optimize_model
when the pipeline has loaded.
from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model
qa_pipeline = pipeline(
"question-answering",
model="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1",
tokenizer="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1"
)
print("bert-base-uncased parameters: 200.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")
print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)
✨ Features
- Pruning: The linear layers of the model contain 30.0% of the original weights, and the model contains 45.0% of the original weights overall. This significantly reduces the model size and speeds up inference.
- Activation Function: It uses ReLUs instead of GeLUs to speed up inference, which is supported by the Transformers library.
- Accuracy: The model achieves an F1 score of 89.19, a gain of 0.69 compared to the original bert-base-uncased model.
📦 Installation
Install the nn_pruning
library using the following command:
pip install nn_pruning
💻 Usage Examples
Basic Usage
from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model
qa_pipeline = pipeline(
"question-answering",
model="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1",
tokenizer="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1"
)
print("bert-base-uncased parameters: 200.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")
print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)
📚 Documentation
Fine-Pruning details
This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model bert-large-uncased-whole-word-masking-finetuned-squad. It is case-insensitive.
A side-effect of the block pruning is that some of the attention heads are completely removed: 55 heads were removed on a total of 144 (38.2%). You can visualize the distribution of the remaining heads in the network after pruning by hovering on the plot below.
Details of the SQuAD1.1 dataset
Dataset |
Split |
# samples |
SQuAD1.1 |
train |
90.6K |
SQuAD1.1 |
eval |
11.1k |
Fine-tuning
- Python:
3.8.5
- Machine specs:
Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1
Results
Pytorch model file size: 374MB
(original BERT: 420MB
)
Metric |
# Value |
# Original (Table 2) |
Variation |
EM |
82.21 |
80.8 |
+1.41 |
F1 |
89.19 |
88.5 |
+0.69 |
🔧 Technical Details
This model was created using the nn_pruning python library. It uses NoNorms instead of LayerNorms, which requires the use of the optimize_model
function from the nn_pruning
library. The pruning method leads to structured matrices, which allows the model to run 2.01x as fast as bert-base-uncased on the evaluation.
📄 License
This model is released under the MIT license.