Custom-LegalBERT Open-Source AI Model in the Legal Field - Precise Handling of Legal Affairs through Training on Massive Texts

Custom Legalbert

Developed by casehold

BERT model optimized for the legal domain, pretrained from scratch on 37GB of legal ruling texts

Large Language Model English#Legal Text Pretraining #Case Ruling Analysis #Domain-Specific Vocabulary

Downloads 12.59k

Release Time : 3/2/2022

Model Overview

A BERT variant specifically designed for legal texts, supporting masked language modeling and next sentence prediction tasks for legal documents

Model Features

Domain-Specific Legal Vocabulary

Optimized legal term processing through a custom vocabulary of 32,000 tokens

Large-Scale Legal Corpus Training

Pretrained using the complete Harvard Case Collection of 37GB/3,446,187 legal rulings

Domain-Adapted Preprocessing

Tokenization and sentence segmentation specifically optimized for legal text characteristics

Model Capabilities

Legal Text Understanding

Legal Document Classification

Legal Multiple-Choice Reasoning

Case Citation Analysis

Use Cases

Legal Research

Case Citation Prediction

Predict relevant case citations likely to be referenced in legal rulings

Achieved SOTA performance on the CaseHOLD dataset

Legal Clause Analysis

Parse key content in legal documents such as terms of service

Judicial Assistance

Ruling Document Generation

Assist in generating specific sections of legal ruling documents

🚀 Custom Legal-BERT

Model and tokenizer files for Custom Legal-BERT, a specialized model for legal text analysis.

🚀 Quick Start

This README provides details about the Custom Legal-BERT model, including its training data, objectives, usage, and citation information.

✨ Features

Domain-Specific: Tailored for legal text with a custom legal vocabulary.
Large Training Corpus: Pretrained on a substantial Harvard Law case corpus.
MLM and NSP Objectives: Pretrained from scratch using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Please see the casehold repository for scripts that support computing pretrain loss and finetuning on Custom Legal-BERT for classification and multiple choice tasks described in the paper: Overruling, Terms of Service, CaseHOLD.

📚 Documentation

Model Details

The Custom Legal-BERT model and tokenizer files are sourced from When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset.

Training Data

The pretraining corpus was constructed by ingesting the entire Harvard Law case corpus from 1965 to the present (https://case.law/). The size of this corpus (37GB) is substantial, representing 3,446,187 legal decisions across all federal and state courts, and is larger than the size of the BookCorpus/Wikipedia corpus originally used to train BERT (15GB).

Training Objective

This model is pretrained from scratch for 2M steps on the MLM and NSP objective, with tokenization and sentence segmentation adapted for legal text (cf. the paper).

The model also uses a custom domain-specific legal vocabulary. The vocabulary set is constructed using SentencePiece on a subsample (approx. 13M) of sentences from our pretraining corpus, with the number of tokens fixed to 32,000.

🔧 Technical Details

The model is pretrained using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives for 2M steps. Tokenization and sentence segmentation are adapted for legal text. A custom legal vocabulary of 32,000 tokens is constructed using SentencePiece on a subsample of the pretraining corpus.

📄 License

No license information is provided in the original document.

📚 Citation

@inproceedings{zhengguha2021,
    title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset},
    author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho},
    year={2021},
    eprint={2104.08671},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law},
    publisher={Association for Computing Machinery}
}

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21 - 25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 \[cs.CL\].

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご