bert-base-uncased-squadv1-x2.32 Open-source Q&A Model, Inference Acceleration, Free and Efficient Q&A Experience

Bert Base Uncased Squadv1 X2.32 F86.6 D15 Hybrid V1

Developed by madlag

A QA model fine-tuned on SQuAD v1 based on BERT-base uncased, with 66% of linear layer weights pruned via nn_pruning library, achieving 2.32x inference speedup

Question Answering System

Transformers

EnglishOpen Source License:MIT #QA Acceleration #Structured Pruning #Low-Resource Inference

Downloads 16

Release Time : 3/2/2022

Model Overview

This is a pruned and optimized QA model specifically designed for extracting answers from given texts. The model balances speed and accuracy through structured pruning techniques

Model Features

Efficient Inference

Achieves 2.32x acceleration through structured pruning while maintaining 86.6% F1 score

Attention Head Optimization

Removed 43.8% of attention heads (144→81) to optimize computational efficiency

Knowledge Distillation

Distilled from bert-large-uncased model to enhance small model performance

Model Capabilities

Text Understanding

Answer Extraction

Context Analysis

Use Cases

Customer Support

Automated QA System

Automatically answers user questions from knowledge base documents

F1 score 86.64

Educational Technology

Learning Assistant Tool

Helps students quickly find answers to questions from textbooks

🚀 BERT-base uncased model fine-tuned on SQuAD v1

This model is a fine-tuned BERT-base uncased model on the SQuAD v1 dataset, achieving a balance between performance and efficiency through pruning.

🚀 Quick Start

To use this model, you first need to install the nn_pruning library, which contains the optimization script to pack the linear layers into smaller ones by removing empty rows/columns.

pip install nn_pruning

Then you can use the transformers library almost as usual. You just have to call optimize_model when the pipeline has loaded.

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x2.32-f86.6-d15-hybrid-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x2.32-f86.6-d15-hybrid-v1"
)

print("bert-base-uncased parameters: 165.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

✨ Features

Pruning Optimization: The linear layers of this model contain only 15.0% of the original weights, and overall, it contains 34.0% of the original weights. This pruning method leads to structured matrices, enabling the model to run 2.32x as fast as bert-base-uncased on the evaluation.
Accuracy Balance: Although there is an F1 drop of 1.86 compared to bert-base-uncased, with an F1 of 86.64, it still maintains a relatively high accuracy level.
Case-Insensitive: The model does not distinguish between different cases, such as "english" and "English".

📦 Installation

Install the nn_pruning library using the following command:

pip install nn_pruning

💻 Usage Examples

Basic Usage

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x2.32-f86.6-d15-hybrid-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x2.32-f86.6-d15-hybrid-v1"
)

print("bert-base-uncased parameters: 165.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

📚 Documentation

Model Creation

This model was created using the nn_pruning python library. It was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1 and distilled from the model bert-large-uncased-whole-word-masking-finetuned-squad.

Dataset Details

Property	Details
Model Type	BERT-base uncased fine-tuned on SQuAD v1
Training Data	SQuAD1.1 (train: 90.6K samples, eval: 11.1k samples)

Fine-tuning Details

Python Version: 3.8.5
Machine Specs:

Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1

Results

Metric	Value	Original (Table 2)	Variation
EM	78.77	80.8	-2.03
F1	86.64	88.5	-1.86

Fine-Pruning Details

A side-effect of the block pruning is that some of the attention heads are completely removed: 63 heads were removed on a total of 144 (43.8%). You can visualize the non-zero/zero parts of each matrix by hovering on the plot below.

Here is a detailed view on how the remaining heads are distributed in the network after pruning.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご