bert-base-uncased-squadv1-x1.84 Open-source Q&A Model - Optimizing Pruning for Precise Question Answering

Bert Base Uncased Squadv1 X1.84 F88.7 D36 Hybrid Filled V1

Developed by madlag

This is a Q&A model optimized via nn_pruning library, retaining 50% of original weights, fine-tuned on SQuAD v1 with F1 score reaching 88.72

Question Answering System

Transformers

EnglishOpen Source License:MIT #Q&A System Optimization #Pruning Acceleration #High F1 Score

Downloads 30

Release Time : 3/2/2022

Model Overview

This BERT-based model is optimized for Q&A tasks, achieving 1.84x inference speedup through structured pruning while maintaining high accuracy

Model Features

Efficient Pruning Technology

Achieves structured pruning via nn_pruning library, retaining 36% linear layer weights and 50% overall model parameters

Accelerated Inference

Achieves 1.84x inference speed of dense model thanks to optimized matrix structure

Attention Head Optimization

Removed 33.3% attention heads (48 out of 144) to improve computational efficiency

Performance Improvement

F1 score increased by 0.22 (88.72 vs 88.5), EM value increased by 0.89 (81.69 vs 80.8) compared to original model

Model Capabilities

Text Understanding

Question Answering

Context Extraction

Use Cases

Education

Reading Comprehension Assistance

Helps students quickly extract answers from texts

Achieves 88.72 F1 score on SQuAD test set

Knowledge Management

Document Q&A System

Automatically extracts answers from technical documents

🚀 BERT-base uncased model fine-tuned on SQuAD v1

This model addresses the question-answering task by leveraging the power of the BERT architecture. Fine-tuned on the SQuAD v1 dataset, it offers improved performance and efficiency compared to the original dense model.

🚀 Quick Start

To use this model, first install the nn_pruning library. Then, you can utilize the transformers library with a simple optimization step.

Installation

pip install nn_pruning

Usage Example

from transformers import pipeline
from nn_pruning.inference_model_patcher import optimize_model

qa_pipeline = pipeline(
    "question-answering",
    model="madlag/bert-base-uncased-squadv1-x1.84-f88.7-d36-hybrid-filled-v1",
    tokenizer="madlag/bert-base-uncased-squadv1-x1.84-f88.7-d36-hybrid-filled-v1"
)

print("/home/lagunas/devel/hf/nn_pruning/nn_pruning/analysis/tmp_finetune parameters: 218.0M")
print(f"Parameters count (includes only head pruning, not feed forward pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

print(f"Parameters count after complete optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
predictions = qa_pipeline({
    'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
    'question': "Who is Frederic Chopin?",
})
print("Predictions", predictions)

✨ Features

Pruning for Efficiency: The linear layers contain 36.0% of the original weights, and the model contains 50.0% of the original weights overall. This leads to a 1.84x speedup compared to the dense model during evaluation.
Accuracy Improvement: With an F1 score of 88.72, it has a 0.22 gain in F1 compared to the dense version.
Case-Insensitive: The model does not distinguish between different cases in English.

📚 Documentation

Fine-Pruning details

This model was fine-tuned from the HuggingFace model checkpoint on SQuAD1.1, and distilled from the model csarron/bert-base-uncased-squad-v1.

A side-effect of the block pruning is that some of the attention heads are completely removed: 48 heads were removed out of a total of 144 (33.3%).

Details of the SQuAD1.1 dataset

Dataset	Split	# samples
SQuAD1.1	train	90.6K
SQuAD1.1	eval	11.1k

Fine-tuning

Python: 3.8.5
Machine specs:

Memory: 64 GiB
GPUs: 1 GeForce GTX 3090, with 24GiB memory
GPU driver: 455.23.05, CUDA: 11.1

Results

Metric	# Value	# Original (Table 2)	Variation
EM	81.69	80.8	+0.89
F1	88.72	88.5	+0.22

🔧 Technical Details

The pruning method used in this model leads to structured matrices. To visualize them, you can hover on the plot below to see the non-zero/zero parts of each matrix.

Here is a detailed view on how the remaining heads are distributed in the network after pruning.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご