legal_t5_small_trans_cs_de_small_finetuned Open-source Model - Free Translation from Czech Legal Texts to German

Legal T5 Small Trans Cs De Small Finetuned

Developed by SEBIS

This model is designed for translating legal texts from Czech to German, fine-tuned based on the T5-small architecture.

Machine Translation #Legal text translation #Czech to German #EU legal specialization

Downloads 18

Release Time : 3/2/2022

Model Overview

A sequence-to-sequence model specialized for Czech legal text to German translation, trained on legal parallel corpora such as JRC-ACQUIS, EUROPARL, and DCEP.

Model Features

Legal domain specialization

Optimized training for the terminology and sentence structures of legal texts.

Multi-dataset joint training

Integration of three major legal parallel corpora: JRC-ACQUIS, EUROPARL, and DCEP.

Two-phase training strategy

First learns general features through unsupervised pre-training, followed by supervised fine-tuning.

Model Capabilities

Legal text translation

Cross-language semantic conversion

Specialized terminology processing

Use Cases

Legal document translation

EU legal document translation

Translating Czech versions of EU legal provisions into German versions

BLEU score 44.175 (test set)

Cross-border legal compliance

Assisting businesses in quickly understanding Czech legal requirements through German translations

🚀 legal_t5_small_trans_cs_de_small_finetuned Model

A model designed for translating legal text from Czech to German, offering efficient and accurate legal language translation.

🚀 Quick Start

The legal_t5_small_trans_cs_de_small_finetuned model is used for translating legal text from Czech to German. It was first released in this repository. The model is pre - trained on all translation data through an unsupervised task and then fine - tuned on three parallel corpora from JRC - ACQUIS, Europarl, and DCEP.

✨ Features

Specialized Translation: Specifically designed for legal text translation from Czech to German.
Based on T5: Built upon the t5 - small model, with optimized parameters for efficient performance.
Large - scale Training: Trained on a large corpus of parallel text, including data from multiple datasets.

💻 Usage Examples

Basic Usage

Here is how to use this model to translate legal text from Czech to German in PyTorch:

from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline

pipeline = TranslationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_de_small_finetuned"),
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_de", do_lower_case=False, 
                                            skip_special_tokens=True),
    device=0
)

cs_text = "Vzhledem k tomu, že tento právní předpis bude přímo použitelný v členských státech a zavede mnoho povinností pro ty, na něž se vztahuje, je žádoucí, aby se jim poskytlo více času na přizpůsobení se těmto novým pravidlům."

pipeline([cs_text], max_length=512)

📚 Documentation

Model description

The legal_t5_small_trans_cs_de_small_finetuned model is initially pre - trained on an unsupervised task using all the data in the training set. The unsupervised task is "masked language modelling". It is based on the t5 - small model and trained on a large parallel text corpus. This is a smaller model, which scales down the baseline t5 model by using dmodel = 512, dff = 2,048, 8 - headed attention, and only 6 layers each in the encoder and decoder. This variant has about 60 million parameters.

Intended uses & limitations

The model can be used for translating legal texts from Czech to German.

Training data

The legal_t5_small_trans_cs_de_small_finetuned model (both the supervised task involving only the corresponding language pair and the unsupervised task with all language pairs' data) was trained on the [JRC - ACQUIS](https://wt - public.emm4u.eu/Acquis/index_2.2.html), EUROPARL, and [DCEP](https://ec.europa.eu/jrc/en/language - technologies/dcep) datasets, which consist of 5 million parallel texts.

Training procedure

The model was trained on a single TPU Pod V3 - 8 for a total of 250K steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using the encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square root learning rate schedule for pre - training.

Preprocessing

An unigram model was trained with 88M lines of text from the parallel corpus (of all possible language pairs) to obtain the vocabulary (with byte pair encoding), which is used with this model.

Pretraining

The pre - training data was the combined data from all 42 language pairs. The task for the model was to predict the randomly masked portions of a sentence.

Evaluation results

When the model is used on the translation test dataset, it achieves the following results:

Model	BLEU score
legal_t5_small_trans_cs_de_small_finetuned	44.175

BibTeX entry and citation info

Created by Ahmed Elnaggar/@Elnaggar_AI | [LinkedIn](https://www.linkedin.com/in/prof - ahmed - elnaggar/)

📦 Information Table

Property	Details
Model Type	legal_t5_small_trans_cs_de_small_finetuned
Training Data	[JRC - ACQUIS](https://wt - public.emm4u.eu/Acquis/index_2.2.html), EUROPARL, and [DCEP](https://ec.europa.eu/jrc/en/language - technologies/dcep) dataset consisting of 5 Million parallel texts

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご