Optimized Q&A Model of bert-base-uncased-squadv1 - Open Source and Free, More Than Two-fold Faster Inference, Significant F1 Improvement

Bert Base Uncased Squadv1 X2.01 F89.2 D30 Hybrid Rewind Opt V1

Developed by madlag

A Q&A system model fine-tuned on SQuAD v1 based on the BERT-base uncased model, optimized via the nn_pruning library, achieving 2.01x faster inference speed and a 0.69 improvement in F1 score.

Question Answering System

Transformers

EnglishOpen Source License:MIT #Q&A Acceleration #Pruning Optimization #High F1 Score

Downloads 22

Release Time : 3/2/2022

Model Overview

This is a BERT model optimized for Q&A tasks, achieving efficient inference through structured pruning and distillation techniques, suitable for extracting answers from given texts.

Model Features

Efficient Inference

Through structured pruning techniques, the model achieves 2.01x the speed of the original BERT.

Performance Improvement

F1 score of 89.19, a 0.69 improvement over the original BERT.

Attention Head Optimization

55 out of 144 attention heads (38.2%) were removed, retaining critical attention patterns.

Activation Function Optimization

Replaced GeLU with ReLU to accelerate inference without special handling.

Model Capabilities

Text Q&A

Context Understanding

Answer Extraction

Use Cases

Education

Historical Knowledge Q&A

Extract answers to specific questions from historical texts.

Accurately identifies factual information such as the location of the Eiffel Tower.

Information Retrieval

Document Q&A System

Quickly locate answers in technical documents.

Achieves an F1 score of 89.19.

🚀 BERT-base uncased model fine-tuned on SQuAD v1

This project presents a fine-tuned BERT-base uncased model on the SQuAD v1 dataset. It leverages pruning techniques to reduce the model size and speed up inference while maintaining high accuracy.

🚀 Quick Start

To start using this model, you first need to install the nn_pruning library. This library contains the optimization script that packs the linear layers into smaller ones by removing empty rows/columns.

pip install nn_pruning

Then, you can use the transformers library almost as usual. You just have to call optimize_model when the pipeline has loaded.

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1"
)

print("bert-base-uncased parameters: 200.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

✨ Features

Pruning: The linear layers of the model contain 30.0% of the original weights, and the model contains 45.0% of the original weights overall. This significantly reduces the model size and speeds up inference.
Activation Function: It uses ReLUs instead of GeLUs to speed up inference, which is supported by the Transformers library.
Accuracy: The model achieves an F1 score of 89.19, a gain of 0.69 compared to the original bert-base-uncased model.

📦 Installation

Install the nn_pruning library using the following command:

pip install nn_pruning

💻 Usage Examples

Basic Usage

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1"
)

print("bert-base-uncased parameters: 200.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

📚 Documentation

Fine-Pruning details

This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model bert-large-uncased-whole-word-masking-finetuned-squad. It is case-insensitive.

A side-effect of the block pruning is that some of the attention heads are completely removed: 55 heads were removed on a total of 144 (38.2%). You can visualize the distribution of the remaining heads in the network after pruning by hovering on the plot below.

Details of the SQuAD1.1 dataset

Dataset	Split	# samples
SQuAD1.1	train	90.6K
SQuAD1.1	eval	11.1k

Fine-tuning

Python: 3.8.5
Machine specs:

Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1

Results

Pytorch model file size: 374MB (original BERT: 420MB)

Metric	# Value	# Original (Table 2)	Variation
EM	82.21	80.8	+1.41
F1	89.19	88.5	+0.69

🔧 Technical Details

This model was created using the nn_pruning python library. It uses NoNorms instead of LayerNorms, which requires the use of the optimize_model function from the nn_pruning library. The pruning method leads to structured matrices, which allows the model to run 2.01x as fast as bert-base-uncased on the evaluation.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご