Legalbert Open-source Legal Pre-trained Model - Optimized for the Characteristics of Legal Texts

Legalbert

Developed by casehold

BERT-based pre-trained model specialized for legal texts with optimizations for legal text characteristics

Large Language Model English#Legal text pre-training #Judicial decision analysis #CaseHOLD dataset

Downloads 467

Release Time : 3/2/2022

Model Overview

This model is a BERT variant further pre-trained on large-scale legal judgment texts, specifically designed for natural language processing tasks in the legal domain, such as legal text classification and case analysis.

Model Features

Legal domain specialization

Further pre-trained on 37GB of legal judgment texts, optimized for legal terminology and text structure

Large-scale training data

Training corpus includes 3,446,187 legal judgments, far exceeding the scale of original BERT training data

Multi-task support

Supports masked language modeling, next sentence prediction, and legal-specific tasks like CaseHOLD multiple-choice questions

Model Capabilities

Legal text understanding

Legal text classification

Legal multiple-choice question answering

Legal text generation

Legal semantic analysis

Use Cases

Legal text analysis

Precedent overturning prediction

Analyze legal judgment texts to predict the likelihood of overturning precedents

Automatically classify legal contracts and terms of service

Legal education

CaseHOLD multiple-choice question answering

Assist in answering case-based multiple-choice questions in legal education

🚀 Legal-BERT

This project provides the model and tokenizer files for the Legal-BERT model. It aims to support legal - related tasks by leveraging self - supervised learning in the legal domain, which is significant for legal text analysis and classification.

✨ Features

Specialized for Law: Trained on a large - scale legal corpus, making it well - suited for legal text processing.
Based on BERT: Initialized with the base BERT model and further trained for legal - specific tasks.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model Source

Model and tokenizer files for Legal - BERT model from When Does Pretraining Help? Assessing Self - Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings.

Training Data

Property	Details
Model Type	Legal - BERT
Training Data	The pretraining corpus was constructed by ingesting the entire Harvard Law case corpus from 1965 to the present (https://case.law/). The size of this corpus (37GB) is substantial, representing 3,446,187 legal decisions across all federal and state courts, and is larger than the size of the BookCorpus/Wikipedia corpus originally used to train BERT (15GB).

Training Objective

This model is initialized with the base BERT model (uncased, 110M parameters), [bert - base - uncased](https://huggingface.co/bert - base - uncased), and trained for an additional 1M steps on the MLM and NSP objective, with tokenization and sentence segmentation adapted for legal text (cf. the paper).

Usage

Please see the casehold repository for scripts that support computing pretrain loss and finetuning on Legal - BERT for classification and multiple choice tasks described in the paper: Overruling, Terms of Service, CaseHOLD.

Citation

@inproceedings{zhengguha2021,
    title={When Does Pretraining Help? Assessing Self - Supervised Learning for Law and the CaseHOLD Dataset},
    author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho},
    year={2021},
    eprint={2104.08671},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law},
    publisher={Association for Computing Machinery}
}

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self - Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21 - 25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL].