🚀 (BERT base) Language modeling in the legal domain in Portuguese
legal-bert-base-cased-ptbr is a Language Model in the legal domain in Portuguese. It is built upon the BERTimbau base model using a MASK objective. This model aims to support NLP research in the legal field, as well as computer law and legal technology applications. A variety of Portuguese legal texts were utilized for training (more details below).
The large version of the model will be available soon.
✨ Features
- Domain - Specific: Tailored for the legal domain in Portuguese.
- Research - Oriented: Supports NLP research in law - related fields.
- Versatile Application: Applicable in computer law and legal technology.
📦 Installation
The installation involves loading the pre - trained model. You can use the following Python code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
model = AutoModel.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
from transformers import pipeline
pipe = pipeline('fill-mask', "dominguesm/legal-bert-base-cased-ptbr")
📚 Documentation
Pre - training corpora
The pre - training corpora of legal-bert-base-cased-ptbr consists of:
- 61309 - Miscellaneous legal documents
- 751 - Petitions
- 682 - Sentences
- 498 - 2nd Instance Accords
- 469 - RE grievances
- 411 - Admissibility Order
The data was provided by the BRAZILIAN SUPREME FEDERAL TRIBUNAL under the terms of use: LREC 2020. The results of this project do not represent the position of the BRAZILIAN SUPREME FEDERAL TRIBUNAL, and all responsibilities lie solely with the model's author.
Use legal-bert-base-cased-ptbr variants as Language Models
Property |
Details |
Model Type |
Fill - Mask Language Model |
Training Data |
Miscellaneous legal documents, petitions, sentences, 2nd Instance Accords, RE grievances, Admissibility Order from the BRAZILIAN SUPREME FEDERAL TRIBUNAL |
Text |
Masked token |
Predictions |
De ordem, a Secretaria Judiciária do Supremo Tribunal Federal INTIMA a parte abaixo identificada, ou quem as suas vezes fizer, do inteiro teor do(a) despacho/decisão presente nos autos (art. 270 do Código de Processo [MASK] e art 5º da Lei 11.419/2006). |
Civil |
('Civil', 0.9999), ('civil', 0.0001), ('Penal', 0.0000), ('eletrônico', 0.0000), ('2015', 0.0000) |
2. INTIMAÇÃO da Autarquia: 2.2 Para que apresente em Juízo, com a contestação, cópia do processo administrativo referente ao benefício [MASK] em discussão na lide |
previdenciário |
('ora', 0.9424), ('administrativo', 0.0202), ('doença', 0.0117), ('acidente', 0.0037), ('posto', 0.0036) |
Certifico que, nesta data, os presentes autos foram remetidos ao [MASK] para processar e julgar recurso (Agravo de Instrumento). |
STF |
('Tribunal', 0.4278), ('Supremo', 0.1657), ('origem', 0.1538), ('arquivo', 0.1415), ('sistema', 0.0216) |
TEMA: 810. Validade da correção monetária e dos juros moratórios [MASK] sobre as condenações impostas à Fazenda Pública, conforme previstos no art. 1º - F da Lei 9.494/1997, com a redação dada pela Lei 11.960/2009. |
incidentes |
('incidentes', 0.9979), ('incidente', 0.0021), ('aplicados', 0.0000), (',', 0.0000), ('aplicada', 0.0000) |
Training results
Num examples = 353435
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 33135
TRAIN RESULTS
"epoch": 3.0
"train_loss": 0.6107781137512769
"train_runtime": 10192.1545
"train_samples": 353435
"train_samples_per_second": 104.031
"train_steps_per_second": 3.251
EVAL RESULTS
"epoch": 3.0
"eval_loss": 0.47251805663108826
"eval_runtime": 126.3026
"eval_samples": 17878
"eval_samples_per_second": 141.549
"eval_steps_per_second": 4.426
"perplexity": 1.604028145934512
Citation
@misc{domingues2022legal-bert-base-cased-ptbr,
author = {Domingues, Maicon},
title = {Language Model in the legal domain in Portuguese},
year={2022},
howpublished= {\url{https://huggingface.co/dominguesm/legal-bert-base-cased-ptbr/}}
}
📄 License
This model is licensed under the cc - by - 4.0 license.