Legal-xlm-roberta-base Open Source Model - Supports Legal Text Processing in 24 European Languages

Home

Legal Xlm Roberta Base

Developed by joelniklaus

A multilingual XLM-RoBERTa model pre-trained on legal data, supporting legal text processing in 24 European languages

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:CC #Legal text processing #Multilingual support #RoBERTa architecture

Downloads 387

Release Time : 1/16/2023

Model Overview

This model is a further pre-trained version of XLM-RoBERTa base on multilingual legal corpora, specifically optimized for downstream tasks in the legal domain

Model Features

Legal domain optimization

Pre-trained specifically on 689GB of multilingual legal corpora, excelling in legal text processing

Multilingual support

Supports legal text processing in 24 European languages, including minority languages like Maltese and Irish

Long text processing capability

Optimized with window attention mechanism and 15% masking rate, suitable for processing lengthy legal texts

Model Capabilities

Legal text understanding

Multilingual text classification

Legal QA systems

Legal entity recognition

Use Cases

Legal text analysis

Legal document classification

Automatic classification of multilingual legal documents

Outperforms in LEXTREME benchmark tests

Legal QA system

Building cross-jurisdictional legal QA applications

Legal research assistance

Cross-jurisdictional legal provision comparison

Analyzing similarities and differences in legal provisions across countries

🚀 Model Card for joelito/legal-xlm-roberta-base

This is a multilingual model pretrained on legal data, based on XLM-R, aiming to provide strong performance in the legal domain.

🚀 Quick Start

To get started with the model, see huggingface tutorials. For masked word prediction, refer to this tutorial.

✨ Features

Multilingual Support: Covers 24 languages including bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv.
Legal Domain Focus: Specifically trained on legal data to perform well in legal tasks.
Transformer-based Architecture: Based on the RoBERTa architecture, a powerful Transformer-based language model.

📦 Installation

No specific installation steps were provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModel
model = AutoModel.from_pretrained('joelito/legal-xlm-roberta-base')
print(model)

Advanced Usage

The main purpose of this model is to be fine - tuned for downstream tasks such as sequence classification, token classification, or question answering. However, no specific advanced usage code examples were provided in the original document.

📚 Documentation

Model Details

Model Description

Property	Details
Developed by	Joel Niklaus: huggingface; email
Model Type	Transformer-based language model (RoBERTa)
Language(s) (NLP)	bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
License	CC BY - SA

Uses

Direct Use and Downstream Use

You can use the raw model for masked language modeling since next sentence prediction was not performed. Its main use is for fine - tuning on downstream tasks that rely on the entire sentence, potentially with masked elements, to make decisions. Examples of such tasks include sequence classification, token classification, or question answering. For text generation tasks, models like GPT - 2 are more suitable.

Note that this model is trained on legal data, so its performance may vary when applied to non - legal data.

Out-of-Scope Use

For tasks such as text generation, you should consider models like GPT2. Also, the model should not be used to create hostile or alienating environments for people. It was not trained to be a factual or true representation of people or events, so using it to generate such content is beyond its capabilities.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes, identity characteristics, and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model.

Training Details

This model was pretrained on Multi Legal Pile (Niklaus et al. 2023).

Training Steps

Warm - starting: Initialize from the original XLM - R checkpoints (base and large) of Conneau et al. (2019).
Tokenization: Train a new 128K BPE tokenizer to cover legal language better. Reuse original XLM - R embeddings for lexically overlapping tokens and use random embeddings for the rest.
Pretraining: Continue pretraining on Multi Legal Pile with batches of 512 samples for an additional 1M/500K steps for the base/large model. Use warm - up steps, a linearly increasing learning rate, and cosine decay scheduling.
Sentence Sampling: Employ a sentence sampler with exponential smoothing to handle disparate token proportions across cantons and languages.
Mixed Cased Models: The models cover both upper - and lowercase letters.
Long Context Training: Train the base - size multilingual model on long contexts with windowed attention for legal documents.

Training Data

The model was pretrained on Multi Legal Pile (Niklaus et al. 2023).

Preprocessing

For more details, see Niklaus et al. 2023.

Training Hyperparameters

Property	Details
Batch size	512 samples
Number of steps	1M/500K for the base/large model
Warm - up steps	First 5% of the total training steps
Learning rate	(Linearly increasing up to) 1e - 4
Word masking	Increased 20/30% masking rate for base/large models respectively

Evaluation

For more insights into the evaluation, refer to the trainer state. Additional information is available in the tensorboard.

For performance on downstream tasks such as LEXTREME (Niklaus et al. 2023) or LEXGLUE (Chalkidis et al. 2021), refer to the results presented in Niklaus et al. (2023) 1, 2.

Model Architecture and Objective

It is a RoBERTa - based model. The architecture can be viewed by running the following code:

from transformers import AutoModel
model = AutoModel.from_pretrained('joelito/legal-xlm-roberta-base')
print(model)

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(128000, 768, padding_idx=0)
    (position_embeddings): Embedding(514, 768, padding_idx=0)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): RobertaIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): RobertaOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): RobertaPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

Compute Infrastructure

Hardware

Google TPU v3 - 8

Software

pytorch, transformers

🔧 Technical Details

Training Process

The model's training process involves multiple steps including warm - starting, tokenization, pretraining, sentence sampling, handling mixed cases, and long context training. Each step is carefully designed to optimize the model for legal data.

Hyperparameter Tuning

The hyperparameters such as batch size, number of steps, learning rate, and word masking rate are tuned to achieve better performance on legal tasks.

📄 License

The model is released under the CC BY - SA license.

Citation

@article{Niklaus2023MultiLegalPileA6,
  title={MultiLegalPile: A 689GB Multilingual Legal Corpus},
  author={Joel Niklaus and Veton Matoshi and Matthias Sturmer and Ilias Chalkidis and Daniel E. Ho},
  journal={ArXiv},
  year={2023},
  volume={abs/2306.02069}
}

Model Card Authors

Joel Niklaus: huggingface; email
Veton Matoshi: huggingface; email

Model Card Contact

Joel Niklaus: huggingface; email
Veton Matoshi: huggingface; email

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご