Model Overview
Model Features
Model Capabilities
Use Cases
đ Model Card for joelito/legal-xlm-roberta-base
This is a multilingual model pretrained on legal data, based on XLM-R, aiming to provide strong performance in the legal domain.
đ Quick Start
To get started with the model, see huggingface tutorials. For masked word prediction, refer to this tutorial.
⨠Features
- Multilingual Support: Covers 24 languages including bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv.
- Legal Domain Focus: Specifically trained on legal data to perform well in legal tasks.
- Transformer-based Architecture: Based on the RoBERTa architecture, a powerful Transformer-based language model.
đĻ Installation
No specific installation steps were provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoModel
model = AutoModel.from_pretrained('joelito/legal-xlm-roberta-base')
print(model)
Advanced Usage
The main purpose of this model is to be fine - tuned for downstream tasks such as sequence classification, token classification, or question answering. However, no specific advanced usage code examples were provided in the original document.
đ Documentation
Model Details
Model Description
Property | Details |
---|---|
Developed by | Joel Niklaus: huggingface; email |
Model Type | Transformer-based language model (RoBERTa) |
Language(s) (NLP) | bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv |
License | CC BY - SA |
Uses
Direct Use and Downstream Use
You can use the raw model for masked language modeling since next sentence prediction was not performed. Its main use is for fine - tuning on downstream tasks that rely on the entire sentence, potentially with masked elements, to make decisions. Examples of such tasks include sequence classification, token classification, or question answering. For text generation tasks, models like GPT - 2 are more suitable.
Note that this model is trained on legal data, so its performance may vary when applied to non - legal data.
Out-of-Scope Use
For tasks such as text generation, you should consider models like GPT2. Also, the model should not be used to create hostile or alienating environments for people. It was not trained to be a factual or true representation of people or events, so using it to generate such content is beyond its capabilities.
Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models (e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes, identity characteristics, and sensitive, social, and occupational groups.
Recommendations
Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model.
Training Details
This model was pretrained on Multi Legal Pile (Niklaus et al. 2023).
Training Steps
- Warm - starting: Initialize from the original XLM - R checkpoints (base and large) of Conneau et al. (2019).
- Tokenization: Train a new 128K BPE tokenizer to cover legal language better. Reuse original XLM - R embeddings for lexically overlapping tokens and use random embeddings for the rest.
- Pretraining: Continue pretraining on Multi Legal Pile with batches of 512 samples for an additional 1M/500K steps for the base/large model. Use warm - up steps, a linearly increasing learning rate, and cosine decay scheduling.
- Sentence Sampling: Employ a sentence sampler with exponential smoothing to handle disparate token proportions across cantons and languages.
- Mixed Cased Models: The models cover both upper - and lowercase letters.
- Long Context Training: Train the base - size multilingual model on long contexts with windowed attention for legal documents.
Training Data
The model was pretrained on Multi Legal Pile (Niklaus et al. 2023).
Preprocessing
For more details, see Niklaus et al. 2023.
Training Hyperparameters
Property | Details |
---|---|
Batch size | 512 samples |
Number of steps | 1M/500K for the base/large model |
Warm - up steps | First 5% of the total training steps |
Learning rate | (Linearly increasing up to) 1e - 4 |
Word masking | Increased 20/30% masking rate for base/large models respectively |
Evaluation
For more insights into the evaluation, refer to the trainer state. Additional information is available in the tensorboard.
For performance on downstream tasks such as LEXTREME (Niklaus et al. 2023) or LEXGLUE (Chalkidis et al. 2021), refer to the results presented in Niklaus et al. (2023) 1, 2.
Model Architecture and Objective
It is a RoBERTa - based model. The architecture can be viewed by running the following code:
from transformers import AutoModel
model = AutoModel.from_pretrained('joelito/legal-xlm-roberta-base')
print(model)
RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(128000, 768, padding_idx=0)
(position_embeddings): Embedding(514, 768, padding_idx=0)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): RobertaPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
Compute Infrastructure
Hardware
Google TPU v3 - 8
Software
pytorch, transformers
đ§ Technical Details
Training Process
The model's training process involves multiple steps including warm - starting, tokenization, pretraining, sentence sampling, handling mixed cases, and long context training. Each step is carefully designed to optimize the model for legal data.
Hyperparameter Tuning
The hyperparameters such as batch size, number of steps, learning rate, and word masking rate are tuned to achieve better performance on legal tasks.
đ License
The model is released under the CC BY - SA license.
Citation
@article{Niklaus2023MultiLegalPileA6,
title={MultiLegalPile: A 689GB Multilingual Legal Corpus},
author={Joel Niklaus and Veton Matoshi and Matthias Sturmer and Ilias Chalkidis and Daniel E. Ho},
journal={ArXiv},
year={2023},
volume={abs/2306.02069}
}
Model Card Authors
- Joel Niklaus: huggingface; email
- Veton Matoshi: huggingface; email
Model Card Contact
- Joel Niklaus: huggingface; email
- Veton Matoshi: huggingface; email

