t5-base-nl36-finnish Open-Source Model - Finnish Pretraining, Suitable for Downstream Tasks after Fine-Tuning

T5 Base Nl36 Finnish

Developed by Finnish-NLP

A T5 model pre-trained on Finnish using span-based masked language modeling objectives, requires fine-tuning for downstream tasks

Large Language Model OtherOpen Source License:Apache-2.0 #Finnish text generation #Deep narrow architecture #Cross-task fine-tuning

Downloads 19

Release Time : 4/15/2022

Model Overview

This is a T5 model pre-trained in a self-supervised manner on a large corpus of Finnish text, employing an encoder-decoder architecture that frames all NLP problems as text-to-text tasks. The model requires fine-tuning for specific tasks before practical application.

Model Features

Efficient Deep Architecture

Uses a deep-narrow 36-layer transformer architecture that outperforms the standard 12-layer T5-base

Improved Pre-training Techniques

Incorporates T5 v1.1 improvements: GEGLU activation, no dropout during pre-training, pure MLM objective training

High-quality Training Data

Trained on 76GB of rigorously cleaned Finnish text from diverse sources including Wikipedia and news

Model Capabilities

Text generation

Text transformation

Sequence-to-sequence tasks

Use Cases

Text Processing

Case and Punctuation Correction

After fine-tuning, can automatically correct case and punctuation errors in Finnish text

Refer to Finnish-NLP/t5-small-nl24-casing-punctuation-correction model

Text Classification

News Classification

Achieved 94.4% accuracy when fine-tuned on Yle news dataset

Outperforms multilingual mT5 models of similar parameter size

🚀 T5-base-nl36 for Finnish

A pre-trained T5 model for the Finnish language, using span-based masked language modeling (MLM) as the objective.

This model is a pre-trained T5 model designed for the Finnish language, leveraging a span-based masked language modeling (MLM) objective. T5 was first introduced in this paper and initially released on this page.

⚠️ Important Note

The Hugging Face inference widget is deactivated because this model requires text-to-text fine-tuning on a specific downstream task to be practical. For an example of a fine-tuned Finnish T5 model, you can refer to Finnish-NLP/t5-small-nl24-casing-punctuation-correction, which has been fine-tuned to correct missing casing and punctuation in Finnish text.

✨ Features

Model Description

T5 is an encoder-decoder model that approaches all NLP problems in a text-to-text format.

Finnish T5 is a transformer model pre-trained on a vast corpus of Finnish data in a self-supervised manner. This means it was pre-trained solely on raw texts, without any human labeling (allowing it to utilize a large amount of publicly available data), using an automated process to generate inputs and outputs from those texts.

Specifically, it was pre-trained with the span-based masked language modeling (MLM) objective. Spans of the input sequence are masked by so-called sentinel tokens (also known as unique mask tokens), and the output sequence is formed by concatenating the same sentinel tokens and the actual masked tokens. Through this process, the model learns an internal representation of the Finnish language.

During pre-training, this model incorporated the improvements from T5 v1.1 compared to the original T5 model:

GEGLU activation in the feed-forward hidden layer, instead of ReLU - see here
Dropout was disabled during pre-training (resulting in improved quality). Dropout should be re-enabled during fine-tuning
Pre-trained only on the span-based masked language modeling (MLM) objective, without incorporating downstream tasks
No parameter sharing between the embedding and classifier layers

This model also utilized the "efficient" T5 architecture findings presented in this paper. In brief, the paper suggests that a Deep-Narrow model architecture is more favorable for downstream performance compared to other model architectures with a similar number of parameters. More precisely, model depth is defined as the number of transformer blocks stacked sequentially.

This model employs the layer depth of the t5-efficient-base-nl36 architecture, meaning both the encoder and the decoder have 36 transformer layers, compared to the original T5 "base" model architecture, which has 12 transformer layers.

In total, this model has 814 million parameters.

Intended Uses & Limitations

This model was only pre-trained in a self-supervised manner, without any supervised training. Therefore, it must be fine-tuned before it can be used on a downstream task, such as text classification, unlike Google's original T5 model.

⚠️ Important Note

You will most likely need to fine-tune these T5 models without mixed precision, using full fp32 precision. You can also find more fine-tuning tips here.

How to Use

Here is how to use this model in PyTorch:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("Finnish-NLP/t5-base-nl36-finnish")
model = T5ForConditionalGeneration.from_pretrained("Finnish-NLP/t5-base-nl36-finnish")

And in TensorFlow:

from transformers import T5Tokenizer, TFT5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("Finnish-NLP/t5-base-nl36-finnish")
model = T5ForConditionalGeneration.from_pretrained("Finnish-NLP/t5-base-nl36-finnish", from_pt=True)

Limitations and Bias

The training data used for this model contains a large amount of unfiltered content from the internet, which is far from neutral. Therefore, the model may produce biased predictions. This bias will also affect all fine-tuned versions of this model.

Training Data

This Finnish T5 model was pre-trained on a combination of six datasets:

mc4_fi_cleaned: The mC4 dataset is a multilingual, colossal, and cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it using our own text data cleaning codes (check the dataset repo).
wikipedia: We used the Finnish subset of the Wikipedia (August 2021) dataset.
Yle Finnish News Archive 2011 - 2018
Yle Finnish News Archive 2019 - 2020
Finnish News Agency Archive (STT)
The Suomi24 Sentences Corpus

The raw datasets were automatically cleaned to filter out low-quality and non-Finnish examples. Additionally, a perplexity score was calculated for all texts using a KenLM model trained on very clean Finnish texts only. This perplexity score can then be used to determine the "cleanliness" of the Finnish language in the text. Finally, all datasets were concatenated, and the top 90% of the perplexity scores were used as a filtering threshold to remove the lowest-quality 10% of the texts. Together, these cleaned datasets amounted to approximately 76GB of text.

Training Procedure

Preprocessing

The texts are tokenized using WordPiece with a vocabulary size of 32000. The inputs and outputs are sequences of 512 consecutive tokens. The texts are not lowercased, so this model is case-sensitive: it distinguishes between "finnish" and "Finnish".

Pretraining

The model was trained on a TPUv3 - 8 VM, sponsored by the Google TPU Research Cloud, for 1M steps with a batch size of 64 (a total of 33B tokens). The optimizer used was AdaFactor with a learning rate warm-up for 10K steps at a constant learning rate of 1e - 2, followed by an inverse square root decay (exponential decay) of the learning rate.

The training code was from Google's Jax/Flax-based t5x framework, and some t5x task definitions were adapted from Per's t5x work.

Evaluation Results

Evaluation was conducted by fine-tuning the model on a downstream text classification task using two different labeled Finnish datasets: Yle News and Eduskunta. Classification fine-tuning was performed with a sequence length of 128 tokens.

When fine-tuned on these datasets, this model (the sixth row of the table) achieves the following accuracy results compared to our other T5 models and their parameter counts:

Model	Model parameters	Yle News accuracy	Eduskunta accuracy
Finnish-NLP/t5-tiny-nl6-finnish	31 million	92.80	69.07
Finnish-NLP/t5-mini-nl8-finnish	72 million	93.89	71.43
Finnish-NLP/t5-small-nl16-finnish	184 million	94.46	74.00
Finnish-NLP/t5-small-nl24-finnish	260 million	94.68	74.90
Finnish-NLP/byt5-base-finnish	582 million	92.33	73.13
Finnish-NLP/t5-base-nl36-finnish	814 million	94.40	75.97
Finnish-NLP/t5-large-nl36-finnish	1425 million	94.17	73.50

When fine-tuning Google's multilingual mT5 models on the same datasets, we can clearly see that our monolingual Finnish T5 models achieve significantly better results in Finnish text classification:

Model	Model parameters	Yle News accuracy	Eduskunta accuracy
google/mt5-small	301 million	91.51	64.10
google/mt5-base	583 million	92.71	68.40

📄 License

This project is licensed under the Apache-2.0 license.

Acknowledgements

This project would not have been possible without the generous computing resources provided by Google through the TPU Research Cloud.

Team Members

Aapo Tanskanen, Hugging Face profile, LinkedIn profile
Rasmus Toivanen, Hugging Face profile, LinkedIn profile

Feel free to contact us for more details 🤗

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご