Model Overview
Model Features
Model Capabilities
Use Cases
đ LEGAL-BERT: The Muppets straight out of Law School
LEGAL-BERT is a family of BERT models designed for the legal domain, aiming to support legal NLP research, computational law, and legal technology applications.
đ Quick Start
To load the pre-trained LEGAL-BERT model, you can use the following code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-uncased-echr")
model = AutoModel.from_pretrained("nlpaueb/bert-base-uncased-echr")
⨠Features
- LEGAL-BERT is a collection of BERT models tailored for the legal domain, assisting in legal NLP research, computational law, and legal technology applications.
- Different sub - domain variants (CONTRACTS-, EURLEX-, ECHR-) and the general LEGAL-BERT perform better than vanilla BERT for domain - specific tasks.
- This specific variant is pre - trained on ECHR cases.
đ Documentation
Pre - training corpora
The pre - training corpora of LEGAL-BERT include:
- 116,062 documents of EU legislation, publicly available from EURLEX (http://eur-lex.europa.eu), the repository of EU Law under the EU Publication Office.
- 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk).
- 19,867 cases from the European Court of Justice (ECJ), also available from EURLEX.
- 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng).
- 164,141 cases from various courts across the USA, hosted in the Case Law Access Project portal (https://case.law).
- 76,366 US contracts from EDGAR, the database of US Securities and Exchange Commission (SECOM) (https://www.sec.gov/edgar.shtml).
Pre - training details
- We trained BERT using the official code from Google BERT's GitHub repository (https://github.com/google-research/bert).
- We released a model similar to the English BERT - BASE model (12 - layer, 768 - hidden, 12 - heads, 110M parameters).
- We followed the same training setup: 1 million training steps with batches of 256 sequences of length 512 and an initial learning rate of 1e - 4.
- We used a single Google Cloud TPU v3 - 8 provided for free from TensorFlow Research Cloud (TFRC) and GCP research credits.
Models list
Property | Details |
---|---|
Model Type | There are multiple variants of LEGAL - BERT, including CONTRACTS - BERT - BASE, EURLEX - BERT - BASE, ECHR - BERT - BASE, LEGAL - BERT - BASE, and LEGAL - BERT - SMALL. |
Training Data | Different models are trained on different corpora, such as US contracts, EU legislation, ECHR cases, or all the above. |
Model name | Model Path | Training corpora |
---|---|---|
CONTRACTS - BERT - BASE | nlpaueb/bert - base - uncased - contracts |
US contracts |
EURLEX - BERT - BASE | nlpaueb/bert - base - uncased - eurlex |
EU legislation |
ECHR - BERT - BASE | nlpaueb/bert - base - uncased - echr |
ECHR cases |
LEGAL - BERT - BASE * | nlpaueb/legal - bert - base - uncased |
All |
LEGAL - BERT - SMALL | nlpaueb/legal - bert - small - uncased |
All |
* LEGAL - BERT - BASE is the model referred to as LEGAL - BERT - SC in Chalkidis et al. (2020); a model trained from scratch in the legal corpora mentioned below using a newly created vocabulary by a sentence - piece tokenizer trained on the very same corpora.
** The LEGAL - BERT - FP models have been released in Archive.org (https://archive.org/details/legal_bert_fp).
Use LEGAL - BERT variants as Language Models
Corpus | Model | Masked token | Predictions |
---|---|---|---|
BERT - BASE - UNCASED | |||
(Contracts) | This [MASK] Agreement is between General Motors and John Murray. | employment | ('new', '0.09'), ('current', '0.04'), ('proposed', '0.03'), ('marketing', '0.03'), ('joint', '0.02') |
(ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.32'), ('rape', '0.22'), ('abuse', '0.14'), ('death', '0.04'), ('violence', '0.03') |
(EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. | bovine | ('farm', '0.25'), ('livestock', '0.08'), ('draft', '0.06'), ('domestic', '0.05'), ('wild', '0.05') |
CONTRACTS - BERT - BASE | |||
(Contracts) | This [MASK] Agreement is between General Motors and John Murray. | employment | ('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02') |
(ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('death', '0.39'), ('imprisonment', '0.07'), ('contempt', '0.05'), ('being', '0.03'), ('crime', '0.02') |
(EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. | bovine | (('domestic', '0.18'), ('laboratory', '0.07'), ('household', '0.06'), ('personal', '0.06'), ('the', '0.04') |
EURLEX - BERT - BASE | |||
(Contracts) | This [MASK] Agreement is between General Motors and John Murray. | employment | ('supply', '0.11'), ('cooperation', '0.08'), ('service', '0.07'), ('licence', '0.07'), ('distribution', '0.05') |
(ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.66'), ('death', '0.07'), ('imprisonment', '0.07'), ('murder', '0.04'), ('rape', '0.02') |
(EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. | bovine | ('live', '0.43'), ('pet', '0.28'), ('certain', '0.05'), ('fur', '0.03'), ('the', '0.02') |
ECHR - BERT - BASE | |||
(Contracts) | This [MASK] Agreement is between General Motors and John Murray. | employment | ('second', '0.24'), ('latter', '0.10'), ('draft', '0.05'), ('bilateral', '0.05'), ('arbitration', '0.04') |
(ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.99'), ('death', '0.01'), ('inhuman', '0.00'), ('beating', '0.00'), ('rape', '0.00') |
(EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. | bovine | ('pet', '0.17'), ('all', '0.12'), ('slaughtered', '0.10'), ('domestic', '0.07'), ('individual', '0.05') |
LEGAL - BERT - BASE | |||
(Contracts) | This [MASK] Agreement is between General Motors and John Murray. | employment | ('settlement', '0.26'), ('letter', '0.23'), ('dealer', '0.04'), ('master', '0.02'), ('supplemental', '0.02') |
(ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '1.00'), ('detention', '0.00'), ('arrest', '0.00'), ('rape', '0.00'), ('death', '0.00') |
(EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. | bovine | ('live', '0.67'), ('beef', '0.17'), ('farm', '0.03'), ('pet', '0.02'), ('dairy', '0.01') |
LEGAL - BERT - SMALL | |||
(Contracts) | This [MASK] Agreement is between General Motors and John Murray. | employment | ('license', '0.09'), ('transition', '0.08'), ('settlement', '0.04'), ('consent', '0.03'), ('letter', '0.03') |
(ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.59'), ('pain', '0.05'), ('ptsd', '0.05'), ('death', '0.02'), ('tuberculosis', '0.02') |
(EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. | bovine | ('all', '0.08'), ('live', '0.07'), ('certain', '0.07'), ('the', '0.07'), ('farm', '0.05') |
Evaluation on downstream tasks
Refer to the experiments in the article "LEGAL - BERT: The Muppets straight out of Law School" by Chalkidis et al., 2020 (https://aclanthology.org/2020.findings - emnlp.261).
Author - Publication
@inproceedings{chalkidis - etal - 2020 - legal,
title = "{LEGAL}-{BERT}: The Muppets straight out of Law School",
author = "Chalkidis, Ilias and
Fergadiotis, Manos and
Malakasiotis, Prodromos and
Aletras, Nikolaos and
Androutsopoulos, Ion",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
doi = "10.18653/v1/2020.findings - emnlp.261",
pages = "2898--2904"
}
About Us
AUEB's Natural Language Processing Group develops algorithms, models, and systems for natural language processing and generation.
The group's research interests include:
- Question answering systems for various data sources, especially biomedical question answering.
- Natural language generation from databases and ontologies, especially Semantic Web ontologies.
- Text classification, including spam and abusive content filtering.
- Information extraction and opinion mining, including legal text analytics and sentiment analysis.
- Natural language processing tools for Greek, such as parsers and named - entity recognizers.
- Machine learning in natural language processing, especially deep learning.
The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
Ilias Chalkidis on behalf of AUEB's Natural Language Processing Group
| Github: @ilias.chalkidis | Twitter: @KiddoThe2B |
đ License
This project is licensed under the CC - BY - SA - 4.0 license.

