DistilBERT Open-Source Question-Answering AI Model - Free Deployment, INT8 Quantization Optimizes Question-Answering Speed and Size

Distilbert Base Uncased Distilled Squad Int8 Static Inc

Developed by Intel

This is the INT8 quantized version of the DistilBERT base uncased model, specifically designed for question answering tasks, optimized for model size and inference speed through post-training static quantization.

Question Answering System

Transformers

Open Source License:Apache-2.0 #Question Answering System Optimization #INT8 Quantization #Low-resource Deployment

Downloads 1,737

Release Time : 8/4/2022

Model Overview

This model is the INT8 quantized version of the DistilBERT base uncased model, fine-tuned on the Stanford Question Answering Dataset (SQuAD). The quantization process utilized Hugging Face's Optimum-Intel toolkit and Intel® Neural Compressor technology, aiming to significantly reduce model size and inference latency while maintaining high accuracy.

Model Features

INT8 Quantization

Converts the model from FP32 precision to INT8 through post-training static quantization, significantly reducing model size and inference latency.

Efficient Inference

The optimized model is suitable for deployment in resource-constrained environments, providing low-latency question answering capabilities.

High Accuracy Retention

Strives to retain the original model's accuracy during quantization, ensuring efficient execution of question answering tasks.

Model Capabilities

Question Answering

Text Understanding

Context Analysis

Use Cases

Question Answering Systems

Context-based Question Answering

Answers questions based on given contextual paragraphs, suitable for scenarios such as knowledge base queries and customer service systems.

F1 Score: 86.1069 (INT8 PyTorch version)

🚀 INT8 DistilBERT Base Uncased Fine-Tuned on SQuAD

This model is an INT8 quantized version of DistilBERT base uncased, fine - tuned on the Stanford Question Answering Dataset (SQuAD). Quantization was performed using Hugging Face's Optimum - Intel with the Intel® Neural Compressor, aiming to optimize size and inference speed while maintaining accuracy.

✨ Features

Quantization: Converted from FP32 to INT8 using post - training static quantization.
Optimized for QA: Designed for question - answering tasks with fast inference and reduced model size.
Multi - framework Support: Available in both PyTorch and ONNX versions.

📦 Installation

No specific installation steps provided in the original document.

💻 Usage Examples

Basic Usage

# Optimum Intel w/ Neural Compressor
from optimum.intel import INCModelForQuestionAnswering

model_id = "Intel/distilbert-base-uncased-distilled-squad-int8-static"
int8_model = INCModelForQuestionAnswering.from_pretrained(model_id)

Advanced Usage

# Optimum w/ ONNX Runtime
from optimum.onnxruntime import ORTModelForQuestionAnswering
model = ORTModelForQuestionAnswering.from_pretrained('Intel/distilbert-base-uncased-distilled-squad-int8-static')

📚 Documentation

Model Details

Property	Details
Model Authors	Xin He Zixuan Cheng Yu Wenz
Date	Aug 4, 2022
Version	The base model for this quantization process was distilbert - base - uncased - distilled - squad, a distilled version of BERT designed for the question - answering task.
Model Type	Language Model
Paper or Other Resources	Base Model: [distilbert - base - uncased - distilled - squad](https://huggingface.co/distilbert/distilbert - base - uncased - distilled - squad)
License	apache - 2.0
Questions or Comments	[Community Tab](https://huggingface.co/Intel/distilbert - base - uncased - distilled - squad - int8 - static - inc/discussions) and Intel DevHub Discord
Quantization Details	The model underwent post - training static quantization to convert it from its original FP32 precision to INT8, optimizing for size and inference speed while aiming to retain as much of the original model's accuracy as possible.
Calibration Details	For PyTorch, the calibration dataloader was the train dataloader with a real sampling size of 304 due to the default calibration sampling size of 300 not being exactly divisible by the batch size of 8. For the ONNX version, the calibration was performed using the eval dataloader with a default calibration sampling size of 100.

Intended Use

Property	Details
Primary Intended Uses	This model is intended for question - answering tasks, where it can provide answers to questions given a context passage. It is optimized for scenarios requiring fast inference and reduced model size without significantly compromising accuracy.
Primary Intended Users	Researchers, developers, and enterprises that require efficient, low - latency question answering capabilities in their applications, particularly where computational resources are limited.
Out - of - scope Uses

Evaluation

PyTorch Version

This is an INT8 PyTorch model quantized with [huggingface/optimum - intel](https://github.com/huggingface/optimum - intel) through the usage of [Intel® Neural Compressor](https://github.com/intel/neural - compressor).

	INT8	FP32
Accuracy (eval - f1)	86.1069	86.8374
Model size (MB)	74.7	265

ONNX Version

This is an INT8 ONNX model quantized with [Intel® Neural Compressor](https://github.com/intel/neural - compressor).

	INT8	FP32
Accuracy (eval - f1)	0.8633	0.8687
Model size (MB)	154	254

🔧 Technical Details

The model quantization was achieved using post - training static quantization. For PyTorch, specific calibration dataloader settings were used due to sampling size and batch - size constraints. The ONNX version used a different calibration dataloader with a default sampling size.

📄 License

This model is licensed under the apache - 2.0 license.

⚠️ Important Note

Users should be aware of potential biases present in the training data (SQuAD and Wikipedia), and consider the implications of these biases on the model's outputs. Additionally, quantization may introduce or exacerbate biases in certain scenarios.

💡 Usage Tip

Users should consider the balance between performance and accuracy when deploying quantized models in critical applications.
Further fine - tuning or calibration may be necessary for specific use cases or to meet stricter accuracy requirements.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご