Legal - RoBERTa - large: An Open - Source Legal Domain Language Model for Precise Processing of Legal Text Information

Home

Legal Roberta Large

Developed by lexlms

A legal domain language model continuously pre-trained on the LeXFiles legal corpus based on the RoBERTa large model

Large Language Model

Transformers

English#Legal Text Infilling #Multinational Legal Language #RoBERTa Optimization

Downloads 367

Release Time : 11/11/2022

Model Overview

LexLM is a series of RoBERTa models specifically optimized for the legal domain, enhancing legal text comprehension through continuous pre-training and supporting legal document analysis and processing tasks

Model Features

Legal Domain Optimization

Continuously pre-trained on the diverse LeXFiles legal corpus, specifically optimized for legal text processing capabilities

Mixed Case Support

Consistent with mainstream large language models, supports mixed-case text processing

Balanced Training Strategy

Uses an exponential smoothing sentence sampler to balance token ratios across sub-corpora, preventing overfitting

Efficient Tokenizer

Trained with a new 50K BPE tokenizer, reusing embeddings of overlapping tokens from the original vocabulary

Model Capabilities

Legal Text Comprehension

Legal Document Analysis

Legal Terminology Recognition

Legal Text Fill-Mask Prediction

Use Cases

Legal Document Processing

Legal Agreement Analysis

Analyzing key clauses and terms in legal agreements

Legal Case Analysis

Understanding key facts and legal issues in case descriptions

Legal Text Generation

Legal Document Completion

Automatically completing missing content in legal documents

🚀 LexLM large

LexLM large is a model continued pre - trained from RoBERTa large on the LeXFiles corpus, designed for legal language processing.

🚀 Quick Start

The model can be used for fill - mask tasks. For example, you can input masked legal texts and let the model predict the masked words. Here are some sample texts:

"The applicant submitted that her husband was subjected to treatment amounting to whilst in the custody of police."
"This Agreement is between General Motors and John Murray."
"Establishing a system for the identification and registration of animals and regarding the labelling of beef and beef products."
"Because the Court granted before judgment, the Court effectively stands in the shoes of the Court of Appeals and reviews the defendants’ appeals."

✨ Features

Pre - training on Legal Corpus: This model was continued pre - trained from RoBERTa large (https://huggingface.co/roberta - large) on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lex_files), which makes it more suitable for legal language tasks.
Following Best - practices: LexLM (Base/Large) follow a series of best - practices in language model development:
- Warm - start from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
- Train a new tokenizer of 50k BPEs, and reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
- Continue pre - training on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
- Use a sentence sampler with exponential smoothing of the sub - corpora sampling rate following Conneau et al. (2019) to preserve per - corpus capacity.
- Consider mixed cased models, similar to all recently developed large PLMs.

📦 Installation

No specific installation steps are provided in the original README.

📚 Documentation

Model description

LexLM (Base/Large) are newly released RoBERTa models. The development process adheres to a series of best - practices in language model development:

We warm - start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
We continue pre - training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
We use a sentence sampler with exponential smoothing of the sub - corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub - corpora and we aim to preserve per - corpus capacity (avoid overfitting).
We consider mixed cased models, similar to all recently developed large PLMs.

Intended uses & limitations

More information needed

Training and evaluation data

The model was trained on the LeXFiles corpus (https://huggingface.co/datasets/lexlms/lexfiles). For evaluation results, please consider our work "LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development" (Chalkidis* et al, 2023).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: tpu
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 256
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.05
training_steps: 1000000

Training results

Training Loss	Epoch	Step	Validation Loss
1.1322	0.05	50000	0.8690
1.0137	0.1	100000	0.8053
1.0225	0.15	150000	0.7951
0.9912	0.2	200000	0.7786
0.976	0.25	250000	0.7648
0.9594	0.3	300000	0.7550
0.9525	0.35	350000	0.7482
0.9152	0.4	400000	0.7343
0.8944	0.45	450000	0.7245
0.893	0.5	500000	0.7216
0.8997	1.02	550000	0.6843
0.8517	1.07	600000	0.6687
0.8544	1.12	650000	0.6624
0.8535	1.17	700000	0.6565
0.8064	1.22	750000	0.6523
0.7953	1.27	800000	0.6462
0.8051	1.32	850000	0.6386
0.8148	1.37	900000	0.6383
0.8004	1.42	950000	0.6408
0.8031	1.47	1000000	0.6314

Framework versions

Transformers 4.20.0
Pytorch 1.12.0+cu102
Datasets 2.7.0
Tokenizers 0.12.0

Citation

Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.

@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = july,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}

📄 License

This model is licensed under cc - by - sa - 4.0.

📋 Information Table

Property	Details
Model Type	Fill - mask
Training Data	lexlms/lex_files
License	cc - by - sa - 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご