legalbert-large-1.7M-1开源法律大模型 - 提供英语法律行政文本处理支持

首页

Legalbert Large 1.7M 1

由 pile-of-law 开发

基于RoBERTa预训练目标，在英语法律和行政文本上预训练的BERT大模型

大型语言模型

Transformers

英语#法律文本预训练 #掩码语言建模 #法律文书分析

下载量 120

发布时间 : 4/29/2022

模型简介

该模型采用BERT架构，专门在法律文书堆数据集上预训练，适用于法律相关的自然语言处理任务

模型特点

法律领域专业化

专门在法律和行政文本上预训练，对法律术语有更好的理解

大规模训练数据

使用约256GB的英语法律和行政文本进行预训练

优化的分词器

包含32,000个标记的词汇表，特别包含3,000个法律术语

模型能力

法律文本理解

掩码语言建模

法律文本分类

法律问答

使用案例

法律文书处理

法律术语预测

预测法律文本中的专业术语

如示例中正确预测'appeal'为最可能的填充词

法律文档分析

分析法律文档内容

法律研究辅助

案例检索增强

改进法律案例检索系统

🚀 法律语料库BERT大模型（无大小写区分）

该模型基于英文法律和行政文本进行预训练，采用了RoBERTa的预训练目标，在法律领域的语言处理上具有显著价值。

🚀 快速开始

本模型可直接用于掩码语言建模任务，也能针对下游任务进行微调。由于该模型是在英文法律和行政文本语料库上进行预训练的，因此在法律相关的下游任务中可能表现更出色。

✨ 主要特性

基于BERT大模型（无大小写区分）架构，在法律语料库上进行预训练。
语料库包含约256GB的英文法律和行政文本，涵盖35个数据源，如法律分析、法院意见和文件、政府机构出版物、合同、法规等。

📦 安装指南

文档未提及安装步骤，暂不提供。

💻 使用示例

基础用法

使用管道进行掩码语言建模：

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.6343119740486145, 
  'token': 1151, 
  'token_str': 'appeal'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.10488124936819077, 
  'token': 3542, 
  'token_str': 'objection'}, 
  {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.0708756372332573, 
  'token': 1999, 
  'token_str': 'application'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.02558572217822075, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.013266939669847488, 
  'token': 1347, 
  'token_str': 'action'}]

高级用法

在PyTorch中获取给定文本的特征：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

在TensorFlow中使用：

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 详细文档

预期用途和限制

可使用原始模型进行掩码语言建模，或针对下游任务进行微调。
由于模型在英文法律和行政文本语料库上预训练，法律下游任务可能更适合该模型。

局限性和偏差

请参阅《法律语料库》论文的附录G，了解与数据集和模型使用相关的版权限制。
该模型可能存在有偏差的预测。例如，在使用管道进行掩码语言建模时，对于罪犯的种族描述，模型对“黑人”的预测得分高于“白人”。这种偏差也会影响该模型的所有微调版本。

训练数据

该模型在法律语料库上进行预训练，该数据集包含约256GB的英文法律和行政文本，用于语言模型预训练。法律语料库由35个数据源组成，包括法律分析、法院意见和文件、政府机构出版物、合同、法规、条例、案例集等。我们在《法律语料库》论文的附录E中详细描述了这些数据源。法律语料库数据集采用知识共享署名 - 非商业性使用 - 相同方式共享4.0国际许可协议。

训练过程

预处理

模型词汇表由29,000个自定义词块词汇（使用HuggingFace WordPiece分词器适配法律语料库）和从《布莱克法律词典》中随机抽取的3,000个法律术语组成，词汇表大小为32,000个词块。
使用80 - 10 - 10的掩码、损坏、保留分割方式（如BERT中所述），复制率为20，为每个上下文创建不同的掩码。
使用LexNLP句子分割器生成序列，该分割器可处理法律引用的句子分割（法律引用常被错误地视为句子）。
输入格式为：填充句子直到包含256个词块，然后添加一个[SEP]词块，接着继续填充句子，使整个跨度不超过512个词块。如果系列中的下一个句子太大，则不添加，并使用填充词块填充剩余的上下文长度。

预训练

模型在SambaNova集群上进行训练，使用8个RDU，训练170万步。
使用较小的学习率5e - 6和批量大小128，以缓解训练不稳定问题，这可能是由于训练数据来源的多样性导致的。
采用RoBERTa中描述的无NSP损失的掩码语言建模（MLM）目标进行预训练。
模型在所有步骤中使用长度为512的序列进行预训练。
我们并行训练了两个具有相同设置的模型，使用不同的随机种子。我们选择了对数似然最低的模型法律语料库/legalbert - large - 1.7M - 1（我们称之为PoL - BERT - Large）进行实验，同时也发布了第二个模型法律语料库/legalbert - large - 1.7M - 2。

评估结果

在由LexGLUE论文提供的CaseHOLD变体上进行微调时，PoL - BERT - Large模型取得了以下结果。在下表中，我们还报告了BERT - Large - Uncased和CaseLaw - BERT的结果。我们报告了在下游任务上进行超参数调优的模型结果，以及LexGLUE论文中使用固定实验设置的CaseLaw - BERT模型的结果。

模型	F1值
CaseLaw - BERT（调优后）	78.5
CaseLaw - BERT（LexGLUE）	75.4
PoL - BERT - Large	75.0
BERT - Large - Uncased	71.3

🔧 技术细节

模型类型

基于BERT大模型（无大小写区分）架构，采用RoBERTa预训练目标。

训练数据

约256GB的英文法律和行政文本，来自35个数据源。

训练步骤

在SambaNova集群上使用8个RDU训练170万步。

超参数

学习率：5e - 6；批量大小：128。

掩码策略

80 - 10 - 10的掩码、损坏、保留分割方式，复制率为20。

输入格式

填充句子直到256个词块，添加[SEP]词块，总长度不超过512个词块。

📄 许可证

法律语料库数据集采用知识共享署名 - 非商业性使用 - 相同方式共享4.0国际许可协议。

BibTeX引用

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}