🚀 TurkuNLP Finnish GPT-3 Models
A family of Finnish text generation models based on the BLOOM architecture, offering various parameter sizes for different needs.
🚀 Quick Start
The TurkuNLP Finnish GPT - 3 models are a family of pretrained monolingual GPT - style language models based on the BLOOM architecture. These models have 1.5B parameters for Finnish text generation. Note that these are pure language models and not instruction finetuned for dialogue or answering questions. They are designed to be used as foundational models, which can be further instruction - finetuned to serve as modern chat - models.
✨ Features
- Multiple Sizes: Available in different parameter sizes (from 186M to 13.3B) to suit various computational and performance requirements.
- Finnish - Focused: Specifically trained for the Finnish language, using a combination of multiple Finnish resources.
- Large Training Data: All models are trained on 300B tokens.
📚 Documentation
📦 Model Parameters
Model |
Layers |
Dim |
Heads |
Params |
Small |
12 |
768 |
12 |
186M |
Medium |
24 |
1024 |
16 |
437M |
Large |
24 |
1536 |
16 |
881M |
XL |
24 |
2064 |
24 |
1.5B |
”3B” |
32 |
2560 |
32 |
2.8B |
”8B” |
32 |
4096 |
32 |
7.5B |
"13B" |
40 |
5120 |
40 |
13.3B |
📊 Datasets
We used a combination of multiple Finnish resources for training:
- Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html
- mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
- Common Crawl Finnish https://TODO
- Finnish Wikipedia https://fi.wikipedia.org/wiki
- Lönnrot Projekti Lönnrot http://www.lonnrot.net/
- ePub National library ”epub” collection
- National library ”lehdet” collection
- Suomi24 The Suomi 24 Corpus 2001 - 2020 http://urn.fi/urn:nbn:fi:lb-2021101527
- Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
- STT Finnish News Agency Archive 1992 - 2018 http://urn.fi/urn:nbn:fi:lb-2019041501
- Yle Finnish News Archive 2011 - 2018 http://urn.fi/urn:nbn:fi:lb-2017070501
- Yle Finnish News Archive 2019 - 2020 http://urn.fi/urn:nbn:fi:lb-2021050401
- ROOTS TODO
📈 Sampling Ratios
Dataset |
Chars |
Ratio |
Weight |
W.Ratio |
Parsebank |
35.0B |
16.9% |
1.5 |
22.7% |
mC4 - Fi |
46.3B |
22.4% |
1.0 |
20.0% |
CC - Fi |
79.6B |
38.5% |
1.0 |
34.4% |
Fiwiki |
0.8B |
0.4% |
3.0 |
1.0% |
Lönnrot |
0.8B |
0.4% |
3.0 |
1.0% |
Yle |
1.6B |
0.8% |
2.0 |
1.4% |
STT |
2.2B |
1.1% |
2.0 |
1.9% |
ePub |
13.5B |
6.5% |
1.0 |
5.8% |
Lehdet |
5.8B |
2.8% |
1.0 |
2.5% |
Suomi24 |
20.6B |
9.9% |
1.0 |
8.9% |
Reddit - Fi |
0.7B |
0.4% |
1.0 |
0.3% |
TOTAL |
207.0B |
100.0% |
N/A |
100.0% |
⚠️ Important Note
The models are pure language models and not instruction finetuned for dialogue or answering questions.
📄 License
This project is licensed under the Apache - 2.0 license.
📢 More documentation and a paper are coming soon.