🚀 Finnish Generative Pretrained Transformer
This is a Generative Pretrained Transformer with 881M parameters designed for the Finnish language. The TurkuNLP Finnish GPT - 3 - models are a family of pretrained monolingual GPT - style language models based on the BLOOM architecture. It's important to note that these are pure language models and not instruction finetuned for dialogue or question - answering. They are intended to serve as foundational models, which can be instruction finetuned to function as modern chat - models.
✨ Features
- Based on the BLOOM architecture, offering strong language generation capabilities for Finnish.
- Trained on a large corpus of 300B tokens to ensure high - quality language understanding.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples are provided in the original document, so this section is skipped.
📚 Documentation
Model Parameters
Model |
Layers |
Dim |
Heads |
Params |
Small |
12 |
768 |
12 |
186M |
Medium |
24 |
1024 |
16 |
437M |
Large |
24 |
1536 |
16 |
881M |
XL |
24 |
2064 |
24 |
1.5B |
”3B” |
32 |
2560 |
32 |
2.8B |
”8B” |
32 |
4096 |
32 |
7.5B |
"13B" |
40 |
5120 |
40 |
13.3B |
Datasets
We used a combination of multiple Finnish resources:
- Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html
- mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
- Common Crawl Finnish https://TODO
- Finnish Wikipedia https://fi.wikipedia.org/wiki
- Lönnrot Projekti Lönnrot http://www.lonnrot.net/
- ePub National library ”epub” collection
- National library ”lehdet” collection
- Suomi24 The Suomi 24 Corpus 2001 - 2020 http://urn.fi/urn:nbn:fi:lb-2021101527
- Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
- STT Finnish News Agency Archive 1992 - 2018 http://urn.fi/urn:nbn:fi:lb-2019041501
- Yle Finnish News Archive 2011 - 2018 http://urn.fi/urn:nbn:fi:lb-2017070501
- Yle Finnish News Archive 2019 - 2020 http://urn.fi/urn:nbn:fi:lb-2021050401
- Yle News Archive Easy - to - read Finnish 2011 - 2018 http://urn.fi/urn:nbn:fi:lb-2019050901
- Yle News Archive Easy - to - read Finnish 2019 - 2020 http://urn.fi/urn:nbn:fi:lb-2021050701
- ROOTS TODO
Sampling Ratios
Dataset |
Chars |
Ratio |
Weight |
W.Ratio |
Parsebank |
35.0B |
16.9% |
1.5 |
22.7% |
mC4 - Fi |
46.3B |
22.4% |
1.0 |
20.0% |
CC - Fi |
79.6B |
38.5% |
1.0 |
34.4% |
Fiwiki |
0.8B |
0.4% |
3.0 |
1.0% |
Lönnrot |
0.8B |
0.4% |
3.0 |
1.0% |
Yle |
1.6B |
0.8% |
2.0 |
1.4% |
STT |
2.2B |
1.1% |
2.0 |
1.9% |
ePub |
13.5B |
6.5% |
1.0 |
5.8% |
Lehdet |
5.8B |
2.8% |
1.0 |
2.5% |
Suomi24 |
20.6B |
9.9% |
1.0 |
8.9% |
Reddit - Fi |
0.7B |
0.4% |
1.0 |
0.3% |
TOTAL |
207.0B |
100.0% |
N/A |
100.0% |
Additional Information
More documentation and a paper are coming soon.
🔧 Technical Details
No specific technical implementation details (> 50 words) are provided in the original document, so this section is skipped.
📄 License
The project is licensed under the Apache - 2.0 license.