模型简介
模型特点
模型能力
使用案例
🚀 法律语料库BERT大模型2(无大小写区分)
本项目基于英文法律和行政文本进行预训练,采用 RoBERTa 预训练目标。该模型与 pile-of-law/legalbert-large-1.7M-1 采用相同的训练设置,但使用了不同的随机种子。
🚀 快速开始
本模型可直接用于掩码语言建模任务,也可针对下游任务进行微调。由于该模型是在英文法律和行政文本语料库上进行预训练的,因此对于法律相关的下游任务可能更为适用。
✨ 主要特性
- 基于 BERT大模型(无大小写区分) 架构,在 法律语料库 上进行预训练。
- 语料库包含约256GB的英文法律和行政文本,为语言模型预训练提供了丰富的数据。
- 采用自定义的词块词汇表,结合法律术语,构建了32,000个标记的词汇表。
📦 安装指南
本README未提供具体安装命令,可参考Hugging Face相关文档进行安装。
💻 使用示例
基础用法
你可以使用管道直接进行掩码语言建模:
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")
[{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.5218929052352905,
'token': 4028,
'token_str': 'exception'},
{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.11434809118509293,
'token': 1151,
'token_str': 'appeal'},
{'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.06454459577798843,
'token': 5345,
'token_str': 'exclusion'},
{'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.043593790382146835,
'token': 3677,
'token_str': 'example'},
{'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
'score': 0.03758585825562477,
'token': 3542,
'token_str': 'objection'}]
高级用法
以下是如何在PyTorch中使用该模型获取给定文本的特征:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
在TensorFlow中的使用方法:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
📚 详细文档
模型描述
法律语料库BERT大模型2是一个基于 BERT大模型(无大小写区分) 架构的变换器模型,在 法律语料库 上进行预训练。该语料库包含约256GB的英文法律和行政文本,用于语言模型的预训练。
预期用途与限制
你可以使用原始模型进行掩码语言建模,或针对下游任务进行微调。由于该模型是在英文法律和行政文本语料库上进行预训练的,因此对于法律相关的下游任务可能更为适用。
局限性和偏差
请参阅法律语料库论文的附录G,了解与数据集和模型使用相关的版权限制。
该模型可能存在有偏差的预测。在以下使用掩码语言建模管道的示例中,对于罪犯的种族描述,模型对“黑人”的预测得分高于“白人”。
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
>>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"])
[{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.',
'score': 0.02685137465596199,
'token': 4311,
'token_str': 'black'},
{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.',
'score': 0.013632853515446186,
'token': 4249,
'token_str': 'white'}]
这种偏差也会影响该模型的所有微调版本。
训练数据
法律语料库BERT大模型在法律语料库上进行预训练,该数据集包含约256GB的英文法律和行政文本,用于语言模型的预训练。法律语料库由35个数据源组成,包括法律分析、法院意见和文件、政府机构出版物、合同、法规、条例、案例手册等。我们在法律语料库论文的附录E中详细描述了这些数据源。法律语料库数据集采用知识共享署名 - 非商业性使用 - 相同方式共享4.0国际许可协议。
训练过程
预处理
模型词汇表由29,000个自定义词块标记和3,000个从《布莱克法律词典》中随机抽样的法律术语组成,词汇表大小为32,000个标记。使用 HuggingFace WordPiece分词器 对法律语料库进行适配。采用80 - 10 - 10的掩码、损坏、保留分割方式,如 BERT 中所述,复制率为20,为每个上下文创建不同的掩码。为了生成序列,我们使用 LexNLP句子分割器,它可以处理法律引用的句子分割(法律引用通常会被错误地识别为句子)。输入的格式是填充句子,直到它们包含256个标记,然后是一个 [SEP] 标记,接着继续填充句子,使整个跨度不超过512个标记。如果系列中的下一个句子太大,则不添加,剩余的上下文长度用填充标记填充。
预训练
该模型在SambaNova集群上进行训练,使用8个RDU,训练步数为170万步。我们使用了较小的学习率5e - 6和批量大小128,以缓解训练的不稳定性,这可能是由于训练数据来源的多样性造成的。预训练采用了 RoBERTa 中描述的无NSP损失的掩码语言建模(MLM)目标。模型在所有步骤中都使用长度为512的序列进行预训练。
我们在并行模型训练运行中使用相同的设置训练了两个模型,但使用了不同的随机种子。我们选择了对数似然最低的模型 pile-of-law/legalbert-large-1.7M-1,我们将其称为PoL - BERT - Large进行实验,同时也发布了第二个模型 pile-of-law/legalbert-large-1.7M-2。
评估结果
有关在 LexGLUE论文 提供的CaseHOLD变体上的微调结果,请参阅 pile-of-law/legalbert-large-1.7M-1 的模型卡片。
引用信息
@misc{hendersonkrass2022pileoflaw,
url = {https://arxiv.org/abs/2207.00220},
author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
publisher = {arXiv},
year = {2022}
}



