debertav2-base-uncased開源語言模型 - 英語語料訓練助力文本處理與分析

首頁

Debertav2 Base Uncased

由mlcorelib開發

BERT是一個基於Transformer架構的預訓練語言模型，通過掩碼語言建模和下一句預測任務在英語語料上訓練。

大型語言模型英語開源協議:Apache-2.0 #英語雙向Transformer #掩碼語言建模 #下一句預測

下載量 21

發布時間 : 3/2/2022

模型概述

該模型是基於英語語料庫，通過掩碼語言建模（MLM）目標進行預訓練的Transformer模型，適用於各種自然語言處理任務。

模型特點

雙向上下文理解

通過掩碼語言建模任務，模型能夠學習單詞的雙向上下文表示

多任務預訓練

同時使用掩碼語言建模和下一句預測兩個任務進行預訓練

不區分大小寫

模型對輸入文本的大小寫不敏感，統一處理為小寫形式

模型能力

文本特徵提取

句子關係預測

掩碼詞預測

下游任務微調

使用案例

文本分類

情感分析

對文本進行正面/負面情感分類

在SST-2數據集上達到93.5準確率

問答系統

閱讀理解

基於給定文本回答問題

命名實體識別

實體提取

從文本中識別人名、地名等實體

🚀 BERT基礎模型（無大小寫區分）

BERT基礎模型（無大小寫區分）是一個使用掩碼語言建模（MLM）目標在英語語料上進行預訓練的模型。它能夠學習英語語言的雙向表示，可用於提取對下游任務有用的特徵。

🚀 快速開始

你可以直接使用這個模型進行掩碼語言建模，也可以將其微調用於下游任務。以下是使用示例：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}]

在PyTorch中使用該模型獲取給定文本的特徵：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

在TensorFlow中使用：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ 主要特性

雙向表示學習：通過掩碼語言建模（MLM）目標，模型能夠學習句子的雙向表示，這與傳統的循環神經網絡（RNNs）和自迴歸模型（如GPT）不同。
多任務學習：除了MLM，模型還使用了下一句預測（NSP）目標進行預訓練，使其能夠學習句子之間的關係。
可微調性：預訓練的模型可以在下游任務上進行微調，如序列分類、標記分類或問答任務。

💻 使用示例

基礎用法

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}]

高級用法

在下游任務上微調模型：

# 這裡可以添加微調模型的代碼示例
# 由於原文檔未提供，可根據實際情況補充

📚 詳細文檔

預期用途和限制

你可以使用原始模型進行掩碼語言建模或下一句預測，但它主要用於在下游任務上進行微調。該模型主要針對需要使用整個句子（可能是掩碼後的句子）來做決策的任務進行微調，如序列分類、標記分類或問答任務。對於文本生成等任務，你應該考慮使用像GPT2這樣的模型。

侷限性和偏差

儘管該模型使用的訓練數據可以被認為是相當中立的，但模型可能會有有偏差的預測：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("The man worked as a [MASK].")

[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
  'score': 0.09747550636529922,
  'token': 10533,
  'token_str': 'carpenter'},
 {'sequence': '[CLS] the man worked as a waiter. [SEP]',
  'score': 0.0523831807076931,
  'token': 15610,
  'token_str': 'waiter'},
 {'sequence': '[CLS] the man worked as a barber. [SEP]',
  'score': 0.04962705448269844,
  'token': 13362,
  'token_str': 'barber'},
 {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
  'score': 0.03788609802722931,
  'token': 15893,
  'token_str': 'mechanic'},
 {'sequence': '[CLS] the man worked as a salesman. [SEP]',
  'score': 0.037680890411138535,
  'token': 18968,
  'token_str': 'salesman'}]

>>> unmasker("The woman worked as a [MASK].")

[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
  'score': 0.21981462836265564,
  'token': 6821,
  'token_str': 'nurse'},
 {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
  'score': 0.1597415804862976,
  'token': 13877,
  'token_str': 'waitress'},
 {'sequence': '[CLS] the woman worked as a maid. [SEP]',
  'score': 0.1154729500412941,
  'token': 10850,
  'token_str': 'maid'},
 {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
  'score': 0.037968918681144714,
  'token': 19215,
  'token_str': 'prostitute'},
 {'sequence': '[CLS] the woman worked as a cook. [SEP]',
  'score': 0.03042375110089779,
  'token': 5660,
  'token_str': 'cook'}]

這種偏差也會影響該模型的所有微調版本。

訓練數據

BERT模型在BookCorpus和英文維基百科（不包括列表、表格和標題）上進行預訓練。BookCorpus是一個由11,038本未出版書籍組成的數據集。

訓練過程

預處理

文本被轉換為小寫，並使用WordPiece進行分詞，詞彙表大小為30,000。模型的輸入形式如下：

[CLS] Sentence A [SEP] Sentence B [SEP]

有0.5的概率，句子A和句子B對應原始語料庫中的兩個連續句子，在其他情況下，句子B是語料庫中的另一個隨機句子。這裡的“句子”通常是一段連續的文本，長度通常大於單個句子。唯一的限制是兩個“句子”的組合長度小於512個標記。

每個句子的掩碼過程細節如下：

15%的標記被掩碼。
在80%的情況下，被掩碼的標記被替換為[MASK]。
在10%的情況下，被掩碼的標記被替換為一個與它們不同的隨機標記。
在剩下的10%的情況下，被掩碼的標記保持不變。

預訓練

模型在4個雲TPU（Pod配置，共16個TPU芯片）上訓練了100萬步，批量大小為256。在90%的步驟中，序列長度限制為128個標記，在剩下的10%中為512個標記。使用的優化器是Adam，學習率為1e-4，\(\beta_{1} = 0.9\)，\(\beta_{2} = 0.999\)，權重衰減為0.01，學習率在10,000步內預熱，之後線性衰減。

評估結果

當在下游任務上進行微調時，該模型取得了以下結果：

Glue測試結果：

任務	MNLI-(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	平均
	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6

🔧 技術細節

BERT是一個基於Transformer架構的模型，通過自監督學習的方式在大量英語數據上進行預訓練。它使用了掩碼語言建模（MLM）和下一句預測（NSP）兩個目標，使得模型能夠學習到英語語言的內部表示。這種表示可以用於提取對下游任務有用的特徵，例如在有標籤的句子數據集上訓練標準分類器。

📄 許可證

本模型使用Apache-2.0許可證。

BibTeX引用和引用信息

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}