Model Overview
Model Features
Model Capabilities
Use Cases
🚀 T5-base-nl36 for Finnish
A pre-trained T5 model for the Finnish language, using span-based masked language modeling (MLM) as the objective.
This model is a pre-trained T5 model designed for the Finnish language, leveraging a span-based masked language modeling (MLM) objective. T5 was first introduced in this paper and initially released on this page.
⚠️ Important Note
The Hugging Face inference widget is deactivated because this model requires text-to-text fine-tuning on a specific downstream task to be practical. For an example of a fine-tuned Finnish T5 model, you can refer to Finnish-NLP/t5-small-nl24-casing-punctuation-correction, which has been fine-tuned to correct missing casing and punctuation in Finnish text.
✨ Features
Model Description
T5 is an encoder-decoder model that approaches all NLP problems in a text-to-text format.
Finnish T5 is a transformer model pre-trained on a vast corpus of Finnish data in a self-supervised manner. This means it was pre-trained solely on raw texts, without any human labeling (allowing it to utilize a large amount of publicly available data), using an automated process to generate inputs and outputs from those texts.
Specifically, it was pre-trained with the span-based masked language modeling (MLM) objective. Spans of the input sequence are masked by so-called sentinel tokens (also known as unique mask tokens), and the output sequence is formed by concatenating the same sentinel tokens and the actual masked tokens. Through this process, the model learns an internal representation of the Finnish language.
During pre-training, this model incorporated the improvements from T5 v1.1 compared to the original T5 model:
- GEGLU activation in the feed-forward hidden layer, instead of ReLU - see here
- Dropout was disabled during pre-training (resulting in improved quality). Dropout should be re-enabled during fine-tuning
- Pre-trained only on the span-based masked language modeling (MLM) objective, without incorporating downstream tasks
- No parameter sharing between the embedding and classifier layers
This model also utilized the "efficient" T5 architecture findings presented in this paper. In brief, the paper suggests that a Deep-Narrow model architecture is more favorable for downstream performance compared to other model architectures with a similar number of parameters. More precisely, model depth is defined as the number of transformer blocks stacked sequentially.
This model employs the layer depth of the t5-efficient-base-nl36 architecture, meaning both the encoder and the decoder have 36 transformer layers, compared to the original T5 "base" model architecture, which has 12 transformer layers.
In total, this model has 814 million parameters.
Intended Uses & Limitations
This model was only pre-trained in a self-supervised manner, without any supervised training. Therefore, it must be fine-tuned before it can be used on a downstream task, such as text classification, unlike Google's original T5 model.
⚠️ Important Note
You will most likely need to fine-tune these T5 models without mixed precision, using full fp32 precision. You can also find more fine-tuning tips here.
How to Use
Here is how to use this model in PyTorch:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("Finnish-NLP/t5-base-nl36-finnish")
model = T5ForConditionalGeneration.from_pretrained("Finnish-NLP/t5-base-nl36-finnish")
And in TensorFlow:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("Finnish-NLP/t5-base-nl36-finnish")
model = T5ForConditionalGeneration.from_pretrained("Finnish-NLP/t5-base-nl36-finnish", from_pt=True)
Limitations and Bias
The training data used for this model contains a large amount of unfiltered content from the internet, which is far from neutral. Therefore, the model may produce biased predictions. This bias will also affect all fine-tuned versions of this model.
Training Data
This Finnish T5 model was pre-trained on a combination of six datasets:
- mc4_fi_cleaned: The mC4 dataset is a multilingual, colossal, and cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it using our own text data cleaning codes (check the dataset repo).
- wikipedia: We used the Finnish subset of the Wikipedia (August 2021) dataset.
- Yle Finnish News Archive 2011 - 2018
- Yle Finnish News Archive 2019 - 2020
- Finnish News Agency Archive (STT)
- The Suomi24 Sentences Corpus
The raw datasets were automatically cleaned to filter out low-quality and non-Finnish examples. Additionally, a perplexity score was calculated for all texts using a KenLM model trained on very clean Finnish texts only. This perplexity score can then be used to determine the "cleanliness" of the Finnish language in the text. Finally, all datasets were concatenated, and the top 90% of the perplexity scores were used as a filtering threshold to remove the lowest-quality 10% of the texts. Together, these cleaned datasets amounted to approximately 76GB of text.
Training Procedure
Preprocessing
The texts are tokenized using WordPiece with a vocabulary size of 32000. The inputs and outputs are sequences of 512 consecutive tokens. The texts are not lowercased, so this model is case-sensitive: it distinguishes between "finnish" and "Finnish".
Pretraining
The model was trained on a TPUv3 - 8 VM, sponsored by the Google TPU Research Cloud, for 1M steps with a batch size of 64 (a total of 33B tokens). The optimizer used was AdaFactor with a learning rate warm-up for 10K steps at a constant learning rate of 1e - 2, followed by an inverse square root decay (exponential decay) of the learning rate.
The training code was from Google's Jax/Flax-based t5x framework, and some t5x task definitions were adapted from Per's t5x work.
Evaluation Results
Evaluation was conducted by fine-tuning the model on a downstream text classification task using two different labeled Finnish datasets: Yle News and Eduskunta. Classification fine-tuning was performed with a sequence length of 128 tokens.
When fine-tuned on these datasets, this model (the sixth row of the table) achieves the following accuracy results compared to our other T5 models and their parameter counts:
Model | Model parameters | Yle News accuracy | Eduskunta accuracy |
---|---|---|---|
Finnish-NLP/t5-tiny-nl6-finnish | 31 million | 92.80 | 69.07 |
Finnish-NLP/t5-mini-nl8-finnish | 72 million | 93.89 | 71.43 |
Finnish-NLP/t5-small-nl16-finnish | 184 million | 94.46 | 74.00 |
Finnish-NLP/t5-small-nl24-finnish | 260 million | 94.68 | 74.90 |
Finnish-NLP/byt5-base-finnish | 582 million | 92.33 | 73.13 |
Finnish-NLP/t5-base-nl36-finnish | 814 million | 94.40 | 75.97 |
Finnish-NLP/t5-large-nl36-finnish | 1425 million | 94.17 | 73.50 |
When fine-tuning Google's multilingual mT5 models on the same datasets, we can clearly see that our monolingual Finnish T5 models achieve significantly better results in Finnish text classification:
Model | Model parameters | Yle News accuracy | Eduskunta accuracy |
---|---|---|---|
google/mt5-small | 301 million | 91.51 | 64.10 |
google/mt5-base | 583 million | 92.71 | 68.40 |
📄 License
This project is licensed under the Apache-2.0 license.
Acknowledgements
This project would not have been possible without the generous computing resources provided by Google through the TPU Research Cloud.
Team Members
- Aapo Tanskanen, Hugging Face profile, LinkedIn profile
- Rasmus Toivanen, Hugging Face profile, LinkedIn profile
Feel free to contact us for more details 🤗

