RoBERTalex Open Source Model - Free Processing of Spanish Legal Texts, Focused on Legal Domain Applications

Home

Robertalex

Developed by PlanTL-GOB-ES

A RoBERTa base model trained on Spanish legal domain corpus, specializing in Spanish legal text processing

Large Language Model

Transformers

SpanishOpen Source License:Apache-2.0 #Spanish legal domain #Masked language modeling #RoBERTa architecture

Downloads 379

Release Time : 3/2/2022

Model Overview

This model is a Spanish masked language model based on Transformer architecture, specifically optimized for legal domain texts, suitable for masked language modeling tasks or as a pre-training foundation for downstream tasks

Model Features

Legal domain specialization

Pre-trained on an 8.9GB Spanish legal domain corpus, demonstrating excellent performance in legal text processing

High-quality preprocessing

Training data underwent rigorous preprocessing including sentence segmentation, language detection, abnormal sentence filtering, and content deduplication

Multi-task adaptability

Can be directly used for masked language modeling tasks or fine-tuned as a base model for downstream tasks

Model Capabilities

Legal text understanding

Masked language modeling

Text feature extraction

Legal text classification

Legal named entity recognition

Use Cases

Legal text processing

Legal text completion

Automatically completing missing content in legal documents

Examples show accurate prediction of professional terminology in legal texts

Legal Q&A systems

Serving as a base model for legal question answering systems

Legal document classification

Automatic classification of legal documents

🚀 RoBERTa base trained with Spanish Legal Domain Corpora

RoBERTalex is a Spanish masked language model based on RoBERTa, pre - trained on a large Spanish legal corpus, suitable for fill - mask tasks and fine - tuning on downstream tasks.

🚀 Quick Start

Basic Usage

>>> from transformers import pipeline
>>> from pprint import pprint
>>> unmasker = pipeline('fill - mask', model='PlanTL - GOB - ES/RoBERTalex')
>>> pprint(unmasker("La ley fue <mask> finalmente."))
[{'score': 0.21217258274555206,
  'sequence': ' La ley fue modificada finalmente.',
  'token': 5781,
  'token_str': ' modificada'},
 {'score': 0.20414969325065613,
  'sequence': ' La ley fue derogada finalmente.',
  'token': 15951,
  'token_str': ' derogada'},
 {'score': 0.19272951781749725,
  'sequence': ' La ley fue aprobada finalmente.',
  'token': 5534,
  'token_str': ' aprobada'},
 {'score': 0.061143241822719574,
  'sequence': ' La ley fue revisada finalmente.',
  'token': 14192,
  'token_str': ' revisada'},
 {'score': 0.041809432208538055,
  'sequence': ' La ley fue aplicada finalmente.',
  'token': 12208,
  'token_str': ' aplicada'}]

Advanced Usage

>>> from transformers import RobertaTokenizer, RobertaModel
>>> tokenizer = RobertaTokenizer.from_pretrained('PlanTL - GOB - ES/RoBERTalex')
>>> model = RobertaModel.from_pretrained('PlanTL - GOB - ES/RoBERTalex')
>>> text = "Gracias a los datos legales se ha podido desarrollar este modelo del lenguaje."
>>> encoded_input = tokenizer(text, return_tensors='pt')
>>> output = model(**encoded_input)
>>> print(output.last_hidden_state.shape)
torch.Size([1, 16, 768])

✨ Features

Architecture: Based on roberta - base.
Language: Supports Spanish.
Task: Specialized in fill - mask tasks and can be fine - tuned for downstream tasks.
Data: Trained on legal data.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

📚 Documentation

Overview

Property	Details
Model Type	roberta - base
Language	Spanish
Task	fill - mask
Data	Legal

Model description

The RoBERTalex is a transformer - based masked language model for the Spanish language. It is based on the RoBERTa base model and has been pre - trained using a large Spanish Legal Domain Corpora, with a total of 8.9GB of text.

Intended uses and limitations

The RoBERTalex model is ready - to - use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine - tuned on non - generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition. You can use the raw model for fill mask or fine - tune it to a downstream task.

Limitations and bias

At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.

Training

Training data

The Spanish Legal Domain Corpora corpora comprise multiple digital resources and it has a total of 8.9GB of textual data. Part of it has been obtained from [previous work](https://aclanthology.org/2020.lt4gov - 1.6/). To obtain a high - quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among others, sentence splitting, language detection, filtering of bad - formed sentences, and deduplication of repetitive contents. During the process, document boundaries are kept.

Training procedure

The training corpus has been tokenized using a byte version of Byte - Pair Encoding (BPE) used in the original RoBERTA model with a vocabulary size of 50,262 tokens.

The RoBERTalex pre - training consists of a masked language model training, that follows the approach employed for the RoBERTa base. The model was trained until convergence with 2 computing nodes, each one with 4 NVIDIA V100 GPUs of 16GB VRAM.

Evaluation

Due to the lack of domain - specific evaluation data, the model was evaluated on general domain tasks, where it obtains reasonable performance. We fine - tuned the model in the following task:

Dataset	Metric	RoBERtalex
UD - POS	F1	0.9871
CoNLL - NERC	F1	0.8323
CAPITEL - POS	F1	0.9788
CAPITEL - NERC	F1	0.8394
STS	Combined	0.7374
MLDoc	Accuracy	0.9417
PAWS - X	F1	0.7304
XNLI	Accuracy	0.7337

Additional information

Author

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc - temu@bsc.es)

Contact information

For further information, send an email to <plantl - gob - es@bsc.es>

Copyright

Licensing information

[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0)

Funding

This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan - TL.

Citing information

@misc{gutierrezfandino2021legal,
      title={Spanish Legalese Language Model and Corpora}, 
      author={Asier Gutiérrez - Fandiño and Jordi Armengol - Estapé and Aitor Gonzalez - Agirre and Marta Villegas},
      year={2021},
      eprint={2110.12201},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.

In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご