DistilBERT開源問答模型 - 參量少速度快，免費部署精準解答問題

首頁

Distilbert Base Cased Distilled Squad

由distilbert開發

DistilBERT是BERT的輕量級蒸餾版本，參數量減少40%，速度提升60%，保留95%以上性能。本模型是在SQuAD v1.1數據集上微調的問答專用版本。

問答系統英語開源協議:Apache-2.0 #問答系統 #知識蒸餾 #高效推理

下載量 220.76k

發布時間 : 3/2/2022

模型概述

基於Transformer的輕量級英語問答模型，適用於從給定文本中提取答案的抽取式問答任務。

模型特點

高效輕量

通過知識蒸餾技術，模型體積比原始BERT減少40%，推理速度提升60%

高性能

在SQuAD v1.1驗證集上達到87.1的F1分數，接近原始BERT 88.7的表現

專注問答

專門針對抽取式問答任務優化，可直接用於問答系統開發

模型能力

文本理解

問答提取

上下文分析

使用案例

教育科技

自動答題系統

從教材或參考資料中自動提取問題答案

在SQuAD基準測試中達到87.1 F1分數

客戶服務

FAQ自動應答

從知識庫文檔中快速定位問題答案

🚀 DistilBERT基礎大小寫敏感蒸餾SQuAD模型

DistilBERT基礎大小寫敏感蒸餾SQuAD模型是基於DistilBERT進行微調的模型，可用於問答任務。它在保持較高性能的同時，具有更小的參數規模和更快的運行速度。

🚀 快速開始

使用以下代碼開始使用該模型：

>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

>>> context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """

>>> result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
>>> print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

以下是在PyTorch中使用該模型的方法：

from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

print(outputs)

在TensorFlow中的使用方法如下：

from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
import tensorflow as tf

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

✨ 主要特性

DistilBERT模型：DistilBERT模型在博客文章 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT 和論文 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 中被提出。它是一個小型、快速、低成本且輕量級的Transformer模型，通過蒸餾BERT基礎模型進行訓練。與 bert-base-uncased 相比，它的參數減少了40%，運行速度提高了60%，同時在GLUE語言理解基準測試中保留了BERT超過95%的性能。
微調模型：此模型是 DistilBERT-base-cased 的微調檢查點，使用 SQuAD v1.1 上的知識蒸餾（第二步）進行了微調。

📦 安裝指南

文檔未提供安裝步驟，跳過該章節。

💻 使用示例

基礎用法

>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

>>> context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
... """

>>> result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
>>> print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

高級用法

以下是在不同深度學習框架中使用該模型的示例，可用於更復雜的場景：

PyTorch

from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

print(outputs)

TensorFlow

from transformers import DistilBertTokenizer, TFDistilBertForQuestionAnswering
import tensorflow as tf

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

📚 詳細文檔

用途

該模型可用於問答任務。

誤用和超出範圍的使用

該模型不應被用於故意為人們創造敵對或排斥性的環境。此外，該模型並非用於對人物或事件進行事實性或真實性的表述，因此使用該模型生成此類內容超出了其能力範圍。

風險、侷限性和偏差

⚠️ 重要提示

讀者應注意，該模型生成的語言可能會讓一些人感到不安或冒犯，並可能傳播歷史和當前的刻板印象。

大量研究已經探討了語言模型的偏差和公平性問題（例如，參見 Sheng et al. (2021) 和 Bender et al. (2021)）。該模型生成的預測可能包含針對受保護類別、身份特徵以及敏感、社會和職業群體的令人不安和有害的刻板印象。例如：

>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

>>> context = r"""
... Alice is sitting on the bench. Bob is sitting next to her.
... """

>>> result = question_answerer(question="Who is the CEO?", context=context)
>>> print(
... f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

Answer: 'Bob', score: 0.7527, start: 32, end: 35

用戶（直接用戶和下游用戶）應瞭解該模型的風險、偏差和侷限性。

訓練

訓練數據

distilbert-base-cased模型使用與 distilbert-base-uncased模型相同的數據進行訓練。distilbert-base-uncased模型對其訓練數據的描述如下：

DistilBERT在與BERT相同的數據上進行預訓練，這些數據包括 BookCorpus（一個由11038本未出版書籍組成的數據集）和英文維基百科（不包括列表、表格和標題）。

要了解有關SQuAD v1.1數據集的更多信息，請參閱 SQuAD v1.1數據卡片。

訓練過程

預處理

更多詳細信息請參閱 distilbert-base-cased模型卡片。

預訓練

更多詳細信息請參閱 distilbert-base-cased模型卡片。

評估

如模型倉庫中所討論的：

該模型在 [SQuAD v1.1] 開發集上達到了87.1的F1分數（作為對比，BERT bert-base-cased版本的F1分數為88.7）。

環境影響

可以使用 Lacoste et al. (2019) 中提出的機器學習影響計算器來估算碳排放。我們根據相關論文提供了所使用的硬件類型和時長。請注意，這些細節僅適用於DistilBERT的訓練，不包括使用SQuAD進行的微調。

屬性	詳情
硬件類型	8個16GB V100 GPU
使用時長	90小時
雲服務提供商	未知
計算區域	未知
碳排放	未知

技術規格

有關模型架構、目標、計算基礎設施和訓練細節的詳細信息，請參閱相關論文。

引用信息

@inproceedings{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  booktitle={NeurIPS EMC^2 Workshop},
  year={2019}
}

APA格式：

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.