モデル概要

このモデルはRoBERTaアーキテクチャに基づくポルトガル語言語モデルで、法律テキスト処理に特化しており、ポルトガル語（ブラジルおよびヨーロッパのバリエーションを含む）をサポートします。

モデル特徴

法律分野最適化

ポルトガル語法律テキストに特化した事前学習と最適化が行われています

多様なトレーニングデータ

法律分野(LegalPT)と一般分野(CrawlPT)のデータを組み合わせてトレーニング

高性能

ポルトガル語法律NLPタスクにおいて、同様のモデルよりも優れたパフォーマンスを発揮

データ重複排除

トレーニング前にMinHashアルゴリズムを使用してデータの重複排除を実施

モデル能力

ポルトガル語テキスト理解

法律テキスト分析

固有表現認識

トークン分類

使用事例

法律テキスト処理

法律文書分析

法律文書のキー情報を分析

PortuLexベンチマークテストで85.41%の平均F1スコアを達成

法律エンティティ認識

法律テキスト内の特定エンティティを識別

LeNERデータセットで90.73%のF1スコアを達成

datasets:

eduagarcia/LegalPT_dedup
eduagarcia/CrawlPT_dedup language:
pt pipeline_tag: fill-mask tags:
legal model-index:
name: RoBERTaLexPT-base results:
- task: type: token-classification dataset: type: lener_br name: lener_br split: test metrics:
  - type: seqeval value: 0.9073 name: F1 args: scheme: IOB2
- task: type: token-classification dataset: type: eduagarcia/PortuLex_benchmark name: UlyNER-PL Coarse config: UlyssesNER-Br-PL-coarse split: test metrics:
  - type: seqeval value: 0.8856 name: F1 args: scheme: IOB2
- task: type: token-classification dataset: type: eduagarcia/PortuLex_benchmark name: UlyNER-PL Fine config: UlyssesNER-Br-PL-fine split: test metrics:
  - type: seqeval value: 0.8603 name: F1 args: scheme: IOB2
- task: type: token-classification dataset: type: eduagarcia/PortuLex_benchmark name: FGV-STF config: fgv-coarse split: test metrics:
  - type: seqeval value: 0.8040 name: F1 args: scheme: IOB2
- task: type: token-classification dataset: type: eduagarcia/PortuLex_benchmark name: RRIP config: rrip split: test metrics:
  - type: seqeval value: 0.8322 name: F1 args: scheme: IOB2
- task: type: token-classification dataset: type: eduagarcia/PortuLex_benchmark name: PortuLex split: test metrics:
  - type: seqeval value: 0.8541 name: Average F1 args: scheme: IOB2 license: cc-by-4.0 metrics:
seqeval

RoBERTaLexPT-base

RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora, using the same architecture as RoBERTa-base, introduced by Liu et al. (2019).

Language(s) (NLP): Portuguese (pt-BR and pt-PT)
License: Creative Commons Attribution 4.0 International Public License
Repository: https://github.com/eduagarcia/roberta-legal-portuguese
Paper: https://aclanthology.org/2024.propor-1.38/

Evaluation

The model was evaluated on "PortuLex" benchmark, a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (%) for multiple models evaluated on PortuLex benchmark test splits:

Model	LeNER	UlyNER-PL	FGV-STF	RRIP	Average (%)
		Coarse/Fine	Coarse
BERTimbau-base	88.34	86.39/83.83	79.34	82.34	83.78
BERTimbau-large	88.64	87.77/84.74	79.71	83.79	84.60
Albertina-PT-BR-base	89.26	86.35/84.63	79.30	81.16	83.80
Albertina-PT-BR-xlarge	90.09	88.36/86.62	79.94	82.79	85.08
BERTikal-base	83.68	79.21/75.70	77.73	81.11	79.99
JurisBERT-base	81.74	81.67/77.97	76.04	80.85	79.61
BERTimbauLAW-base	84.90	87.11/84.42	79.78	82.35	83.20
Legal-XLM-R-base	87.48	83.49/83.16	79.79	82.35	83.24
Legal-XLM-R-large	88.39	84.65/84.55	79.36	81.66	83.50
Legal-RoBERTa-PT-large	87.96	88.32/84.83	79.57	81.98	84.02
Ours
RoBERTaTimbau-base (Reproduction of BERTimbau)	89.68	87.53/85.74	78.82	82.03	84.29
RoBERTaLegalPT-base (Trained on LegalPT)	90.59	85.45/84.40	79.92	82.84	84.57
RoBERTaCrawlPT-base (Trained on CrawlPT)	89.24	88.22/86.58	79.88	82.80	84.83
RoBERTaLexPT-base (this) (Trained on CrawlPT + LegalPT)	90.73	88.56/86.03	80.40	83.22	85.41

In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size. With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.

Training Details

RoBERTaLexPT-base is pretrained on:

LegalPT is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
CrawlPT is a composition of three Portuguese general corpora: brWaC, CC100 PT subset, OSCAR-2301 PT subset.

Training Procedure

Our pretraining process was executed using the Fairseq library v0.10.2 on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. The complete training of a single configuration takes approximately three days.

This computational cost is similar to the work of BERTimbau-base, exposing the model to approximately 65 billion tokens during training.

Preprocessing

We deduplicated all subsets of the LegalPT and CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary text-dedup to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the HuggingFace Tokenizers -- BPE algorithm to train a vocabulary for each pre-training corpus used.

Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
The weight initialization is random.
We employed the masked language modeling objective, where 15% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.

For other parameters we adopted the standard RoBERTa-base hyperparameters:

Hyperparameter	RoBERTa-base
Number of layers	12
Hidden size	768
FFN inner hidden size	3072
Attention heads	12
Attention head size	64
Dropout	0.1
Attention dropout	0.1
Warmup steps	6k
Peak learning rate	4e-4
Batch size	2048
Weight decay	0.01
Maximum training steps	62.5k
Learning rate decay	Linear
AdamW $$\epsilon$$	1e-6
AdamW $$\beta_1$$	0.9
AdamW $$\beta_2$$	0.98
Gradient clipping	0.0

Citation

@inproceedings{garcia-etal-2024-robertalexpt,
    title = "{R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese",
    author = "Garcia, Eduardo A. S.  and
      Silva, Nadia F. F.  and
      Siqueira, Felipe  and
      Albuquerque, Hidelberg O.  and
      Gomes, Juliana R. S.  and
      Souza, Ellen  and
      Lima, Eliomar A.",
    editor = "Gamallo, Pablo  and
      Claro, Daniela  and
      Teixeira, Ant{\'o}nio  and
      Real, Livy  and
      Garcia, Marcos  and
      Oliveira, Hugo Gon{\c{c}}alo  and
      Amaro, Raquel",
    booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
    month = mar,
    year = "2024",
    address = "Santiago de Compostela, Galicia/Spain",
    publisher = "Association for Computational Lingustics",
    url = "https://aclanthology.org/2024.propor-1.38",
    pages = "374--383",
}

Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).