LegalBERT-large-1.7M-1 Open-source Legal Large Model - Providing Support for English Legal and Administrative Text Processing

Home

Legalbert Large 1.7M 1

Developed by pile-of-law

A BERT large model pretrained on English legal and administrative texts using RoBERTa pretraining objectives

Large Language Model

Transformers

English#Legal Text Pretraining #Masked Language Modeling #Legal Document Analysis

Downloads 120

Release Time : 4/29/2022

Model Overview

This model adopts the BERT architecture and is specifically pretrained on the Pile of Law dataset, suitable for legal-related natural language processing tasks

Model Features

Legal Domain Specialization

Pretrained specifically on legal and administrative texts, with better understanding of legal terminology

Large-Scale Training Data

Pretrained using approximately 256GB of English legal and administrative texts

Optimized Tokenizer

Includes a vocabulary of 32,000 tokens, with 3,000 specifically being legal terms

Model Capabilities

Legal Text Understanding

Masked Language Modeling

Legal Text Classification

Legal Question Answering

Use Cases

Legal Document Processing

Legal Term Prediction

Predicting professional terms in legal texts

For example, correctly predicting 'appeal' as the most likely fill-in word

Legal Document Analysis

Analyzing legal document content

Legal Research Assistance

Case Retrieval Enhancement

Improving legal case retrieval systems

🚀 Pile of Law BERT large model (uncased)

This is a pretrained model on English legal and administrative text, using the RoBERTa pretraining objective. It aims to provide a powerful tool for legal - related natural language processing tasks.

✨ Features

Pretrained on a large - scale English legal and administrative text dataset.
Based on the BERT large model (uncased) architecture.
Can be used for masked language modeling or fine - tuned for downstream tasks.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

You can use the model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.6343119740486145, 
  'token': 1151, 
  'token_str': 'appeal'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.10488124936819077, 
  'token': 3542, 
  'token_str': 'objection'}, 
  {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.0708756372332573, 
  'token': 1999, 
  'token_str': 'application'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.02558572217822075, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.013266939669847488, 
  'token': 1347, 
  'token_str': 'action'}]

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

And in TensorFlow:

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

Model description

Pile of Law BERT large is a transformers model with the BERT large model (uncased) architecture pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.

Intended uses & limitations

You can use the raw model for masked language modeling or fine - tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in - domain for this model.

Limitations and bias

Please see Appendix G of the Pile of Law paper for copyright limitations related to dataset and model use.

This model can have biased predictions. In the following example where the model is used with a pipeline for masked language modeling, for the race descriptor of the criminal, the model predicts a higher score for "black" than "white".

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"])

[{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans', 
  'score': 0.0013972163433209062, 
  'token': 4311, 
  'token_str': 'black'}, 
  {'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans', 
  'score': 0.0009401230490766466, 
  'token': 4249, 
  'token_str': 'white'}]

This bias will also affect all fine - tuned versions of this model.

Training data

The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution - NonCommercial - ShareAlike 4.0 International license.

Training procedure

Preprocessing

The model vocabulary consists of 29,000 tokens from a custom word - piece vocabulary fit to Pile of Law using the HuggingFace WordPiece tokenizer and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80 - 10 - 10 masking, corruption, leave split, as in BERT, is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the LexNLP sentence segmenter, which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences). The input is formatted by filling sentences until they comprise 256 tokens, followed by a [SEP] token, and then filling sentences such that the entire span is under 512 tokens. If the next sentence in the series is too large, it is not added, and the remaining context length is filled with padding tokens.

Pretraining

The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e - 6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in RoBERTa, was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.

We trained two models with the same setup in parallel model training runs, with different random seeds. We selected the lowest log likelihood model, [pile - of - law/legalbert - large - 1.7M - 1](https://huggingface.co/pile - of - law/legalbert - large - 1.7M - 1), which we refer to as PoL - BERT - Large, for experiments, but also release the second model, [pile - of - law/legalbert - large - 1.7M - 2](https://huggingface.co/pile - of - law/legalbert - large - 1.7M - 2).

Evaluation results

When finetuned on the CaseHOLD variant provided by the LexGLUE paper, this model, PoL - BERT - Large, achieves the following results. In the table below, we also report results for [BERT - Large - Uncased](https://huggingface.co/bert - large - uncased) and [CaseLaw - BERT](https://huggingface.co/zlucia/custom - legalbert). We report results on the models with hyperparameter tuning on the downstream task and the result reported for the CaseLaw - BERT model from the LexGLUE paper, which uses a fixed experimental setup.

CaseHOLD test results:

Model	F1
CaseLaw - BERT (tuned)	78.5
CaseLaw - BERT (LexGLUE)	75.4
PoL - BERT - Large	75.0
BERT - Large - Uncased	71.3

BibTeX entry and citation info

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open - Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}

🔧 Technical Details

The model uses a custom vocabulary construction method, combines a word - piece vocabulary with legal terms, and adopts specific masking and segmentation strategies during preprocessing. The training process is carried out on a SambaNova cluster with specific hyperparameters to address potential training instability caused by data source diversity.

📄 License

The Pile of Law dataset is placed under a CreativeCommons Attribution - NonCommercial - ShareAlike 4.0 International license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご