Legalbert-large-1.7M-2开源模型 - 助力法律领域英语文本语言理解任务

首页

Legalbert Large 1.7M 2

由 pile-of-law 开发

基于英语法律和行政文本预训练的RoBERTa模型，专注于法律领域的语言理解任务

大型语言模型

Transformers

英语#法律文本预训练 #英语法律分析 #掩码语言建模

下载量 701

发布时间 : 4/29/2022

模型简介

这是一个基于BERT大型架构的transformers模型，使用Pile of Law数据集（约256GB英语法律文本）预训练，适用于法律相关下游任务

模型特点

法律领域专业化

专门针对法律和行政文本进行预训练，包含法律术语和表达方式

RoBERTa预训练目标

采用RoBERTa的掩码语言建模目标，优化了传统BERT的训练方式

大规模训练数据

使用约256GB的Pile of Law数据集进行训练，包含35种法律相关数据源

法律文本优化处理

使用LexNLP句子分割器处理法律引用，优化了法律文本的预处理流程

模型能力

法律文本理解

掩码语言建模

法律文档分析

法律术语识别

使用案例

法律文本处理

法律条款补全

自动补全法律文档中的缺失部分

示例中正确预测'An exception is a request...'等法律术语

法律文档分类

对法律文档进行自动分类

法律研究辅助

法律概念解释

解释法律术语和概念

🚀 法律语料库BERT大模型2（无大小写区分）

本项目基于英文法律和行政文本进行预训练，采用 RoBERTa 预训练目标。该模型与 pile-of-law/legalbert-large-1.7M-1 采用相同的训练设置，但使用了不同的随机种子。

🚀 快速开始

本模型可直接用于掩码语言建模任务，也可针对下游任务进行微调。由于该模型是在英文法律和行政文本语料库上进行预训练的，因此对于法律相关的下游任务可能更为适用。

✨ 主要特性

基于 BERT大模型（无大小写区分）架构，在法律语料库上进行预训练。
语料库包含约256GB的英文法律和行政文本，为语言模型预训练提供了丰富的数据。
采用自定义的词块词汇表，结合法律术语，构建了32,000个标记的词汇表。

📦 安装指南

本README未提供具体安装命令，可参考Hugging Face相关文档进行安装。

💻 使用示例

基础用法

你可以使用管道直接进行掩码语言建模：

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.5218929052352905, 
  'token': 4028, 
  'token_str': 'exception'}, 
  {'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.11434809118509293, 
  'token': 1151, 
  'token_str': 'appeal'}, 
  {'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.06454459577798843, 
  'token': 5345, 
  'token_str': 'exclusion'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.043593790382146835, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.03758585825562477, 
  'token': 3542, 
  'token_str': 'objection'}]

高级用法

以下是如何在PyTorch中使用该模型获取给定文本的特征：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

在TensorFlow中的使用方法：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 详细文档

模型描述

法律语料库BERT大模型2是一个基于 BERT大模型（无大小写区分）架构的变换器模型，在法律语料库上进行预训练。该语料库包含约256GB的英文法律和行政文本，用于语言模型的预训练。

预期用途与限制

你可以使用原始模型进行掩码语言建模，或针对下游任务进行微调。由于该模型是在英文法律和行政文本语料库上进行预训练的，因此对于法律相关的下游任务可能更为适用。

局限性和偏差

请参阅法律语料库论文的附录G，了解与数据集和模型使用相关的版权限制。

该模型可能存在有偏差的预测。在以下使用掩码语言建模管道的示例中，对于罪犯的种族描述，模型对“黑人”的预测得分高于“白人”。

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"])

[{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.', 
  'score': 0.02685137465596199, 
  'token': 4311, 
  'token_str': 'black'}, 
  {'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.', 
  'score': 0.013632853515446186, 
  'token': 4249, 
  'token_str': 'white'}]

这种偏差也会影响该模型的所有微调版本。

训练数据

法律语料库BERT大模型在法律语料库上进行预训练，该数据集包含约256GB的英文法律和行政文本，用于语言模型的预训练。法律语料库由35个数据源组成，包括法律分析、法院意见和文件、政府机构出版物、合同、法规、条例、案例手册等。我们在法律语料库论文的附录E中详细描述了这些数据源。法律语料库数据集采用知识共享署名 - 非商业性使用 - 相同方式共享4.0国际许可协议。

训练过程

预处理

模型词汇表由29,000个自定义词块标记和3,000个从《布莱克法律词典》中随机抽样的法律术语组成，词汇表大小为32,000个标记。使用 HuggingFace WordPiece分词器对法律语料库进行适配。采用80 - 10 - 10的掩码、损坏、保留分割方式，如 BERT 中所述，复制率为20，为每个上下文创建不同的掩码。为了生成序列，我们使用 LexNLP句子分割器，它可以处理法律引用的句子分割（法律引用通常会被错误地识别为句子）。输入的格式是填充句子，直到它们包含256个标记，然后是一个 [SEP] 标记，接着继续填充句子，使整个跨度不超过512个标记。如果系列中的下一个句子太大，则不添加，剩余的上下文长度用填充标记填充。

预训练

该模型在SambaNova集群上进行训练，使用8个RDU，训练步数为170万步。我们使用了较小的学习率5e - 6和批量大小128，以缓解训练的不稳定性，这可能是由于训练数据来源的多样性造成的。预训练采用了 RoBERTa 中描述的无NSP损失的掩码语言建模（MLM）目标。模型在所有步骤中都使用长度为512的序列进行预训练。

我们在并行模型训练运行中使用相同的设置训练了两个模型，但使用了不同的随机种子。我们选择了对数似然最低的模型 pile-of-law/legalbert-large-1.7M-1，我们将其称为PoL - BERT - Large进行实验，同时也发布了第二个模型 pile-of-law/legalbert-large-1.7M-2。

评估结果

有关在 LexGLUE论文提供的CaseHOLD变体上的微调结果，请参阅 pile-of-law/legalbert-large-1.7M-1 的模型卡片。

引用信息

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}