Bert-base-uncased-squadv1 Open-Source Q&A Model - 2.44 Times Faster Inference, Efficiently Answer Questions

Bert Base Uncased Squadv1 X2.44 F87.7 D26 Hybrid Filled V1

Developed by madlag

A QA model fine-tuned on SQuAD v1 based on BERT-base uncased and pruned via nn_pruning library, retaining 42% of original weights with 2.44x inference speed improvement

Question Answering System

Transformers

EnglishOpen Source License:MIT #QA Acceleration #Structured Pruning #Low-resource Deployment

Downloads 17

Release Time : 3/2/2022

Model Overview

This is a BERT model optimized for QA tasks, reducing parameter scale and improving inference efficiency through structured pruning techniques, suitable for English QA scenarios

Model Features

Efficient Inference

Achieves 2.44x faster inference speed than original model through structured pruning

Parameter Optimization

Retains 42% of original weights (only 26% in linear layers), reducing model file size from 420MB to 355MB

Attention Head Pruning

80 out of 144 attention heads removed (55.6%) to optimize computational efficiency

Model Capabilities

English QA

Context Understanding

Text Extraction

Use Cases

Customer Support

Product Knowledge QA

Automatically answer customer inquiries based on product documentation

F1 score 87.71

Education Assistance

Learning Material QA

Extract answers from textbook content

EM score 80.03

🚀 BERT-base uncased model fine-tuned on SQuAD v1

This model is a fine-tuned BERT-base uncased model on SQuAD v1, which uses pruning techniques to reduce weights and speed up inference while maintaining high accuracy.

🚀 Quick Start

To use this model, first install the nn_pruning library:

pip install nn_pruning

Then, you can use the transformers library as follows:

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1"
)

print("/home/lagunas/devel/hf/nn_pruning/nn_pruning/analysis/tmp_finetune parameters: 189.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

✨ Features

Pruned Weights: The linear layers contain 26.0% of the original weights, and the model contains 42.0% of the original weights overall.
Faster Inference: With a simple resizing of the linear matrices, it runs 2.44x as fast as the original model on the evaluation.
High Accuracy: Its F1 is 87.71, with an F1 drop of only 0.79 compared to the original model.

🔧 Technical Details

Model Creation

This model was created using the nn_pruning python library. The pruning method leads to structured matrices, which can be visualized by hovering on the plot below.

Fine-Pruning Details

This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model csarron/bert-base-uncased-squad-v1. It is case-insensitive.

A side-effect of the block pruning is that some of the attention heads are completely removed: 80 heads were removed on a total of 144 (55.6%). Here is a detailed view on how the remaining heads are distributed in the network after pruning.

Dataset Details

Property	Details
Model Type	BERT-base uncased fine-tuned on SQuAD v1
Training Data	SQuAD1.1
Dataset Split	train: 90.6K samples, eval: 11.1k samples

Fine-tuning Environment

Python: 3.8.5
Machine specs:

Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1

Results

Metric	Value	Original (Table 2)	Variation
EM	80.03	80.8	-0.77
F1	87.71	88.5	-0.79

Pytorch model file size: 355MB (original BERT: 420MB)

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご