🚀 BERT-base uncased model fine-tuned on SQuAD v1
This model addresses the question-answering task by leveraging the power of the BERT architecture. Fine-tuned on the SQuAD v1 dataset, it offers improved performance and efficiency compared to the original dense model.
🚀 Quick Start
To use this model, first install the nn_pruning
library. Then, you can utilize the transformers library
with a simple optimization step.
Installation
pip install nn_pruning
Usage Example
from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model
qa_pipeline = pipeline(
"question-answering",
model="madlag/bert-base-uncased-squadv1-x1.84-f88.7-d36-hybrid-filled-v1",
tokenizer="madlag/bert-base-uncased-squadv1-x1.84-f88.7-d36-hybrid-filled-v1"
)
print("/home/lagunas/devel/hf/nn_pruning/nn_pruning/analysis/tmp_finetune parameters: 218.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")
print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)
✨ Features
- Pruning for Efficiency: The linear layers contain 36.0% of the original weights, and the model contains 50.0% of the original weights overall. This leads to a 1.84x speedup compared to the dense model during evaluation.
- Accuracy Improvement: With an F1 score of 88.72, it has a 0.22 gain in F1 compared to the dense version.
- Case-Insensitive: The model does not distinguish between different cases in English.
📚 Documentation
Fine-Pruning details
This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model csarron/bert-base-uncased-squad-v1.
A side-effect of the block pruning is that some of the attention heads are completely removed: 48 heads were removed out of a total of 144 (33.3%).
Details of the SQuAD1.1 dataset
Dataset |
Split |
# samples |
SQuAD1.1 |
train |
90.6K |
SQuAD1.1 |
eval |
11.1k |
Fine-tuning
- Python:
3.8.5
- Machine specs:
Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1
Results
Metric |
# Value |
# Original (Table 2) |
Variation |
EM |
81.69 |
80.8 |
+0.89 |
F1 |
88.72 |
88.5 |
+0.22 |
🔧 Technical Details
The pruning method used in this model leads to structured matrices. To visualize them, you can hover on the plot below to see the non-zero/zero parts of each matrix.
Here is a detailed view on how the remaining heads are distributed in the network after pruning.
📄 License
This model is released under the MIT license.