🚀 legal_t5_small_trans_cs_en_small_finetuned model
A model for translating legal text from Czech to English, initially released in this repository. It first undergoes unsupervised pre - training on all translation data and then is fine - tuned on three parallel corpora.
🚀 Quick Start
This model is designed for translating legal text from Czech to English. It was first released in this repository. The model is pre - trained on all translation data through an unsupervised task and then trained on three parallel corpora from jrc - acquis, europarl, and dcep.
✨ Features
- Initially pre - trained on an unsupervised "masked language modelling" task with all training data.
- Based on the
t5 - small
model, trained on a large parallel text corpus.
- A smaller model with
dmodel = 512
, dff = 2,048
, 8 - headed attention, and 6 layers each in the encoder and decoder, having about 60 million parameters.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
Here is how to use this model to translate legal text from Czech to English in PyTorch:
from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline
pipeline = TranslationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_en_small_finetuned"),
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_en", do_lower_case=False,
skip_special_tokens=True),
device=0
)
cs_text = "4) Seznam užívaných výrobků s obsahem PFOS: Kvůli značnému poklesu výroby PFOS po roce 2000 představují největší zdroj emisí patrně dřívější využití, která však nadále reálně existují."
pipeline([cs_text], max_length=512)
📚 Documentation
Model description
The legal_t5_small_trans_cs_en_small_finetuned model is initially pre - trained on an unsupervised task with all the training data. The unsupervised task is "masked language modelling". It is based on the t5 - small
model and trained on a large parallel text corpus. This is a smaller model that scales down the baseline t5
model, using dmodel = 512
, dff = 2,048
, 8 - headed attention, and having 6 layers each in the encoder and decoder. This variant has about 60 million parameters.
Intended uses & limitations
The model can be used for translating legal texts from Czech to English.
Training data
The legal_t5_small_trans_cs_en_small_finetuned model (both for the supervised task involving only the corresponding language pair and the unsupervised task with all language - pair data) was trained on [JRC - ACQUIS](https://wt - public.emm4u.eu/Acquis/index_2.2.html), EUROPARL, and [DCEP](https://ec.europa.eu/jrc/en/language - technologies/dcep) datasets, which consist of 5 million parallel texts.
Training procedure
- Preprocessing: A unigram model was trained with 88M lines of text from the parallel corpus (of all possible language pairs) to obtain the vocabulary (with byte - pair encoding), which is used with this model.
- Pretraining: The pre - training data was the combined data from all 42 language pairs. The model's task was to predict the randomly masked portions of a sentence.
- The model was trained on a single TPU Pod V3 - 8 for a total of 250K steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using an encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square - root learning rate schedule for pre - training.
Evaluation results
When the model is used for the translation test dataset, it achieves the following results:
Property |
Details |
Model |
legal_t5_small_trans_cs_en_small_finetuned |
BLEU score |
56.936 |
BibTeX entry and citation info
Created by Ahmed Elnaggar/@Elnaggar_AI | [LinkedIn](https://www.linkedin.com/in/prof - ahmed - elnaggar/)