Legalbert-large-1.7M-2開源模型 - 助力法律領域英語文本語言理解任務

首頁

Legalbert Large 1.7M 2

由pile-of-law開發

基於英語法律和行政文本預訓練的RoBERTa模型，專注於法律領域的語言理解任務

大型語言模型

Transformers

英語#法律文本預訓練 #英語法律分析 #掩碼語言建模

下載量 701

發布時間 : 4/29/2022

模型概述

這是一個基於BERT大型架構的transformers模型，使用Pile of Law數據集（約256GB英語法律文本）預訓練，適用於法律相關下游任務

模型特點

法律領域專業化

專門針對法律和行政文本進行預訓練，包含法律術語和表達方式

RoBERTa預訓練目標

採用RoBERTa的掩碼語言建模目標，優化了傳統BERT的訓練方式

大規模訓練數據

使用約256GB的Pile of Law數據集進行訓練，包含35種法律相關數據源

法律文本優化處理

使用LexNLP句子分割器處理法律引用，優化了法律文本的預處理流程

模型能力

法律文本理解

掩碼語言建模

法律文檔分析

法律術語識別

使用案例

法律文本處理

法律條款補全

自動補全法律文檔中的缺失部分

示例中正確預測'An exception is a request...'等法律術語

法律文檔分類

對法律文檔進行自動分類

法律研究輔助

法律概念解釋

解釋法律術語和概念

🚀 法律語料庫BERT大模型2（無大小寫區分）

本項目基於英文法律和行政文本進行預訓練，採用 RoBERTa 預訓練目標。該模型與 pile-of-law/legalbert-large-1.7M-1 採用相同的訓練設置，但使用了不同的隨機種子。

🚀 快速開始

本模型可直接用於掩碼語言建模任務，也可針對下游任務進行微調。由於該模型是在英文法律和行政文本語料庫上進行預訓練的，因此對於法律相關的下游任務可能更為適用。

✨ 主要特性

基於 BERT大模型（無大小寫區分）架構，在法律語料庫上進行預訓練。
語料庫包含約256GB的英文法律和行政文本，為語言模型預訓練提供了豐富的數據。
採用自定義的詞塊詞彙表，結合法律術語，構建了32,000個標記的詞彙表。

📦 安裝指南

本README未提供具體安裝命令，可參考Hugging Face相關文檔進行安裝。

💻 使用示例

基礎用法

你可以使用管道直接進行掩碼語言建模：

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.5218929052352905, 
  'token': 4028, 
  'token_str': 'exception'}, 
  {'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.11434809118509293, 
  'token': 1151, 
  'token_str': 'appeal'}, 
  {'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.06454459577798843, 
  'token': 5345, 
  'token_str': 'exclusion'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.043593790382146835, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.03758585825562477, 
  'token': 3542, 
  'token_str': 'objection'}]

高級用法

以下是如何在PyTorch中使用該模型獲取給定文本的特徵：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

在TensorFlow中的使用方法：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 詳細文檔

模型描述

法律語料庫BERT大模型2是一個基於 BERT大模型（無大小寫區分）架構的變換器模型，在法律語料庫上進行預訓練。該語料庫包含約256GB的英文法律和行政文本，用於語言模型的預訓練。

預期用途與限制

你可以使用原始模型進行掩碼語言建模，或針對下游任務進行微調。由於該模型是在英文法律和行政文本語料庫上進行預訓練的，因此對於法律相關的下游任務可能更為適用。

侷限性和偏差

請參閱法律語料庫論文的附錄G，瞭解與數據集和模型使用相關的版權限制。

該模型可能存在有偏差的預測。在以下使用掩碼語言建模管道的示例中，對於罪犯的種族描述，模型對“黑人”的預測得分高於“白人”。

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"])

[{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.', 
  'score': 0.02685137465596199, 
  'token': 4311, 
  'token_str': 'black'}, 
  {'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.', 
  'score': 0.013632853515446186, 
  'token': 4249, 
  'token_str': 'white'}]

這種偏差也會影響該模型的所有微調版本。

訓練數據

法律語料庫BERT大模型在法律語料庫上進行預訓練，該數據集包含約256GB的英文法律和行政文本，用於語言模型的預訓練。法律語料庫由35個數據源組成，包括法律分析、法院意見和文件、政府機構出版物、合同、法規、條例、案例手冊等。我們在法律語料庫論文的附錄E中詳細描述了這些數據源。法律語料庫數據集採用知識共享署名 - 非商業性使用 - 相同方式共享4.0國際許可協議。

訓練過程

預處理

模型詞彙表由29,000個自定義詞塊標記和3,000個從《布萊克法律詞典》中隨機抽樣的法律術語組成，詞彙表大小為32,000個標記。使用 HuggingFace WordPiece分詞器對法律語料庫進行適配。採用80 - 10 - 10的掩碼、損壞、保留分割方式，如 BERT 中所述，複製率為20，為每個上下文創建不同的掩碼。為了生成序列，我們使用 LexNLP句子分割器，它可以處理法律引用的句子分割（法律引用通常會被錯誤地識別為句子）。輸入的格式是填充句子，直到它們包含256個標記，然後是一個 [SEP] 標記，接著繼續填充句子，使整個跨度不超過512個標記。如果系列中的下一個句子太大，則不添加，剩餘的上下文長度用填充標記填充。

預訓練

該模型在SambaNova集群上進行訓練，使用8個RDU，訓練步數為170萬步。我們使用了較小的學習率5e - 6和批量大小128，以緩解訓練的不穩定性，這可能是由於訓練數據來源的多樣性造成的。預訓練採用了 RoBERTa 中描述的無NSP損失的掩碼語言建模（MLM）目標。模型在所有步驟中都使用長度為512的序列進行預訓練。

我們在並行模型訓練運行中使用相同的設置訓練了兩個模型，但使用了不同的隨機種子。我們選擇了對數似然最低的模型 pile-of-law/legalbert-large-1.7M-1，我們將其稱為PoL - BERT - Large進行實驗，同時也發佈了第二個模型 pile-of-law/legalbert-large-1.7M-2。

評估結果

有關在 LexGLUE論文提供的CaseHOLD變體上的微調結果，請參閱 pile-of-law/legalbert-large-1.7M-1 的模型卡片。

引用信息

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}