legal_t5_small_trans_cs_en_small_finetuned Open Source Model - Free Translation of Czech Legal Texts into English

Legal T5 Small Trans Cs En Small Finetuned

Developed by SEBIS

Small T5 model for Czech legal text to English translation, based on 60 million parameter architecture

Machine Translation #Legal text translation #Czech-English #Lightweight T5

Downloads 18

Release Time : 3/2/2022

Model Overview

This model is specifically designed for translating legal texts from Czech to English, using a lightweight T5 architecture trained on legal corpora such as JRC-ACQUIS, EUROPARL, and DCEP

Model Features

Legal domain optimization

Fine-tuned specifically for legal text translation, with enhanced capability to handle legal terminology and complex sentence structures

Lightweight architecture

Efficient inference achieved by reducing model dimensions (dmodel=512) and layers (6 layers)

Hybrid training strategy

Combines unsupervised pretraining (masked language modeling) with supervised parallel corpus training

Model Capabilities

Legal text translation

Complex sentence processing

Specialized terminology conversion

Use Cases

Legal document translation

EU legal provisions translation

Accurate translation of EU legal documents from Czech to English

BLEU score 56.936

Legal product description translation

Translation of product compliance texts containing specialized terminology

🚀 legal_t5_small_trans_cs_en_small_finetuned model

A model for translating legal text from Czech to English, initially released in this repository. It first undergoes unsupervised pre - training on all translation data and then is fine - tuned on three parallel corpora.

🚀 Quick Start

This model is designed for translating legal text from Czech to English. It was first released in this repository. The model is pre - trained on all translation data through an unsupervised task and then trained on three parallel corpora from jrc - acquis, europarl, and dcep.

✨ Features

Initially pre - trained on an unsupervised "masked language modelling" task with all training data.
Based on the t5 - small model, trained on a large parallel text corpus.
A smaller model with dmodel = 512, dff = 2,048, 8 - headed attention, and 6 layers each in the encoder and decoder, having about 60 million parameters.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Here is how to use this model to translate legal text from Czech to English in PyTorch:

from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline

pipeline = TranslationPipeline(
model=AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_cs_en_small_finetuned"),
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_cs_en", do_lower_case=False, 
                                            skip_special_tokens=True),
    device=0
)

cs_text = "4) Seznam užívaných výrobků s obsahem PFOS: Kvůli značnému poklesu výroby PFOS po roce 2000 představují největší zdroj emisí patrně dřívější využití, která však nadále reálně existují."

pipeline([cs_text], max_length=512)

📚 Documentation

Model description

The legal_t5_small_trans_cs_en_small_finetuned model is initially pre - trained on an unsupervised task with all the training data. The unsupervised task is "masked language modelling". It is based on the t5 - small model and trained on a large parallel text corpus. This is a smaller model that scales down the baseline t5 model, using dmodel = 512, dff = 2,048, 8 - headed attention, and having 6 layers each in the encoder and decoder. This variant has about 60 million parameters.

Intended uses & limitations

The model can be used for translating legal texts from Czech to English.

Training data

The legal_t5_small_trans_cs_en_small_finetuned model (both for the supervised task involving only the corresponding language pair and the unsupervised task with all language - pair data) was trained on [JRC - ACQUIS](https://wt - public.emm4u.eu/Acquis/index_2.2.html), EUROPARL, and [DCEP](https://ec.europa.eu/jrc/en/language - technologies/dcep) datasets, which consist of 5 million parallel texts.

Training procedure

Preprocessing: A unigram model was trained with 88M lines of text from the parallel corpus (of all possible language pairs) to obtain the vocabulary (with byte - pair encoding), which is used with this model.
Pretraining: The pre - training data was the combined data from all 42 language pairs. The model's task was to predict the randomly masked portions of a sentence.
The model was trained on a single TPU Pod V3 - 8 for a total of 250K steps, using a sequence length of 512 (batch size 4096). It has approximately 220M parameters in total and was trained using an encoder - decoder architecture. The optimizer used is AdaFactor with an inverse square - root learning rate schedule for pre - training.

Evaluation results

When the model is used for the translation test dataset, it achieves the following results:

Property	Details
Model	legal_t5_small_trans_cs_en_small_finetuned
BLEU score	56.936

BibTeX entry and citation info

Created by Ahmed Elnaggar/@Elnaggar_AI | [LinkedIn](https://www.linkedin.com/in/prof - ahmed - elnaggar/)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご