🚀 legal_t5_small_trans_es_en_small_finetuned Model
A model designed for translating legal text from Spanish to English, offering high - quality translation services in the legal domain.
🚀 Quick Start
The legal_t5_small_trans_es_en_small_finetuned
model is dedicated to translating legal text from Spanish to English. It was first introduced in this repository. The model is pre - trained on all translation data through an unsupervised task and then fine - tuned on three parallel corpora from jrc - acquis, europarl, and dcep.
✨ Features
- Unsupervised Pretraining: Initially pre - trained on an unsupervised "masked language modelling" task with all the training set data.
- Based on t5 - small: Built upon the
t5 - small
model, it uses dmodel = 512
, dff = 2,048
, 8 - headed attention, and 6 layers each in the encoder and decoder, scaling down the baseline model of t5.
- Smaller Parameter Count: With about 60 million parameters, it is a relatively smaller model.
💻 Usage Examples
Basic Usage
Here is how to use this model to translate legal text from Spanish to English in PyTorch:
from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
pipeline = TranslationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_es_en_small_finetuned"),
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_es_en", do_lower_case=False,
skip_special_tokens=True),
device=0
)
es_text = "de Jonas Sjöstedt (GUE/NGL)"
pipeline([es_text], max_length=512)
📚 Documentation
Model description
The legal_t5_small_trans_es_en_small_finetuned
model is initially pre - trained on an unsupervised "masked language modelling" task with all the data in the training set. It is based on the t5 - small
model and trained on a large parallel text corpus. This smaller model scales down the t5 baseline model, having about 60 million parameters.
Intended uses & limitations
The model is suitable for translating legal texts from Spanish to English.
Training data
The legal_t5_small_trans_es_en_small_finetuned
model was trained on the [JRC - ACQUIS](https://wt - public.emm4u.eu/Acquis/index_2.2.html), EUROPARL, and [DCEP](https://ec.europa.eu/jrc/en/language - technologies/dcep) datasets, which contain 9 million parallel texts.
Training procedure
- Overall Training: The model was trained on a single TPU Pod V3 - 8 for 250K steps in total, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters and uses an encoder - decoder architecture.
- Preprocessing: A unigram model was trained with 88M lines of text from the parallel corpus (of all possible language pairs) to obtain the vocabulary (with byte pair encoding) for this model.
- Pretraining: The pre - training data was the combined data from all 42 language pairs, and the task was to predict randomly masked portions of a sentence.
Evaluation results
When used on the translation test dataset, the model achieves the following results:
Model |
BLEU score |
legal_t5_small_trans_es_en_small_finetuned |
54.481 |
BibTeX entry and citation info
Created by Ahmed Elnaggar/@Elnaggar_AI | [LinkedIn](https://www.linkedin.com/in/prof - ahmed - elnaggar/)
Property |
Details |
Model Type |
A model for translating legal text from Spanish to English, based on t5 - small and fine - tuned on specific corpora |
Training Data |
Trained on [JRC - ACQUIS](https://wt - public.emm4u.eu/Acquis/index_2.2.html), EUROPARL, and [DCEP](https://ec.europa.eu/jrc/en/language - technologies/dcep) datasets with 9 million parallel texts |