Optimized Q&A Model of bert-base-uncased-squadv1 - Open Source Helps Boost Inference Speed by Nearly 2 Times

Bert Base Uncased Squadv1 X1.96 F88.3 D27 Hybrid Filled Opt V1

Developed by madlag

A question-answering model fine-tuned and optimized on SQuAD v1 based on BERT-base uncased, retaining 43% of original weights through pruning techniques, achieving 1.96x faster inference speed

Question Answering System

Transformers

EnglishOpen Source License:MIT #Question Answering System #Pruning Optimization #Fast Inference

Downloads 20

Release Time : 3/2/2022

Model Overview

This is a BERT model optimized for question-answering tasks, pruned using the nn_pruning tool to significantly improve inference speed while maintaining high accuracy

Model Features

Efficient Pruning Technology

Uses nn_pruning tool for pruning, retaining 27% of linear layer weights and 43% of original weights overall

Inference Acceleration

Achieves 1.96x faster inference speed than the original version through structured matrix pruning

Accuracy Retention

F1 score only drops by 0.17 compared to the original version (88.33 vs 88.5), maintaining high accuracy while significantly speeding up

Attention Head Optimization

Prunes 55 out of 144 attention heads (38.2%) to optimize computational efficiency

Model Capabilities

Question Answering System

Text Understanding

Context Extraction

Use Cases

Intelligent Q&A

Factual Question Answering

Answers specific questions based on given context

F1 score 88.33, EM score 81.31

Educational Assistance

Learning Material Comprehension

Helps students quickly locate key information in textbooks

🚀 BERT-base uncased model fine-tuned on SQuAD v1

This is a BERT-base uncased model fine-tuned on the SQuAD v1 dataset. It uses pruning techniques to reduce the number of weights and speed up inference while maintaining high accuracy.

✨ Features

Pruned Weights: The linear layers contain 27.0% of the original weights, and the model contains 43.0% of the original weights overall.
Faster Inference: With a simple resizing of the linear matrices, it ran 1.96x as fast as bert-base-uncased on the evaluation.
High Accuracy: Its F1 score is 88.33, with only a 0.17 drop compared to bert-base-uncased.

📦 Installation

Install nn_pruning: it contains the optimization script, which just pack the linear layers into smaller ones by removing empty rows/columns.

pip install nn_pruning

💻 Usage Examples

Basic Usage

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1"
)

print("bert-base-uncased parameters: 191.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

📚 Documentation

Fine-Pruning details

This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model bert-large-uncased-whole-word-masking-finetuned-squad. This model is case-insensitive: it does not make a difference between english and English.

A side-effect of the block pruning is that some of the attention heads are completely removed: 55 heads were removed on a total of 144 (38.2%). Here is a detailed view on how the remaining heads are distributed in the network after pruning.

Details of the SQuAD1.1 dataset

Property	Details
Dataset	SQuAD1.1
Train Split Samples	90.6K
Eval Split Samples	11.1k

Fine-tuning

Python Version: 3.8.5
Machine Specs:

CPU: Intel(R) Core(TM) i7-6700K CPU
Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1

Results

Metric	Value	Original (Table 2)	Variation
EM	81.31	80.8	+0.51
F1	88.33	88.5	-0.17

Model File Size

Pytorch model file size: 374MB (original BERT: 420MB)

🔧 Technical Details

This model was created using the nn_pruning python library. It uses ReLUs instead of GeLUs as in the initial BERT network, to speedup inference. This does not need special handling, as it is supported by the Transformers library, and flagged in the model config by the "hidden_act": "relu" entry.

This model CANNOT be used without using nn_pruning optimize_model function, as it uses NoNorms instead of LayerNorms and this is not currently supported by the Transformers library.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご