GPT3-Finnish-XL Open Source Model - A Practical Text Generation Tool Designed for Finnish

Gpt3 Finnish Xl

Developed by TurkuNLP

A 1.5 billion parameter generative pre-trained Transformer model for Finnish, trained on 300 billion tokens based on the BLOOM architecture

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Finnish language generation #Large language model #Monolingual pre-training

Downloads 103

Release Time : 2/15/2023

Model Overview

This is a pure language model without instruction fine-tuning, which can serve as a foundation model for dialogue or QA scenarios through fine-tuning

Model Features

Large-scale monolingual training

Specifically trained for Finnish using multi-source data exceeding 207 billion characters

Scalable parameters

Offers model options ranging from 186 million to 13.3 billion parameters

Data-weighted sampling

Employs intelligent weighting strategies for different data sources to optimize training effectiveness

Model Capabilities

Finnish text generation

Language model pre-training

Foundation model fine-tuning

Use Cases

Natural Language Processing

Finnish chatbot

Building Finnish dialogue systems through instruction fine-tuning

Automatic text generation

Generating Finnish news, stories and other content

Education

Finnish learning assistance

Generating Finnish learning materials and exercises

🚀 TurkuNLP Finnish GPT-3 Models

A family of Finnish text generation models based on the BLOOM architecture, offering various parameter sizes for different needs.

🚀 Quick Start

The TurkuNLP Finnish GPT - 3 models are a family of pretrained monolingual GPT - style language models based on the BLOOM architecture. These models have 1.5B parameters for Finnish text generation. Note that these are pure language models and not instruction finetuned for dialogue or answering questions. They are designed to be used as foundational models, which can be further instruction - finetuned to serve as modern chat - models.

✨ Features

Multiple Sizes: Available in different parameter sizes (from 186M to 13.3B) to suit various computational and performance requirements.
Finnish - Focused: Specifically trained for the Finnish language, using a combination of multiple Finnish resources.
Large Training Data: All models are trained on 300B tokens.

📚 Documentation

📦 Model Parameters

Model	Layers	Dim	Heads	Params
Small	12	768	12	186M
Medium	24	1024	16	437M
Large	24	1536	16	881M
XL	24	2064	24	1.5B
”3B”	32	2560	32	2.8B
”8B”	32	4096	32	7.5B
"13B"	40	5120	40	13.3B

📊 Datasets

We used a combination of multiple Finnish resources for training:

Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html
mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
Common Crawl Finnish https://TODO
Finnish Wikipedia https://fi.wikipedia.org/wiki
Lönnrot Projekti Lönnrot http://www.lonnrot.net/
ePub National library ”epub” collection
National library ”lehdet” collection
Suomi24 The Suomi 24 Corpus 2001 - 2020 http://urn.fi/urn:nbn:fi:lb-2021101527
Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
STT Finnish News Agency Archive 1992 - 2018 http://urn.fi/urn:nbn:fi:lb-2019041501
Yle Finnish News Archive 2011 - 2018 http://urn.fi/urn:nbn:fi:lb-2017070501
Yle Finnish News Archive 2019 - 2020 http://urn.fi/urn:nbn:fi:lb-2021050401
ROOTS TODO

📈 Sampling Ratios

Dataset	Chars	Ratio	Weight	W.Ratio
Parsebank	35.0B	16.9%	1.5	22.7%
mC4 - Fi	46.3B	22.4%	1.0	20.0%
CC - Fi	79.6B	38.5%	1.0	34.4%
Fiwiki	0.8B	0.4%	3.0	1.0%
Lönnrot	0.8B	0.4%	3.0	1.0%
Yle	1.6B	0.8%	2.0	1.4%
STT	2.2B	1.1%	2.0	1.9%
ePub	13.5B	6.5%	1.0	5.8%
Lehdet	5.8B	2.8%	1.0	2.5%
Suomi24	20.6B	9.9%	1.0	8.9%
Reddit - Fi	0.7B	0.4%	1.0	0.3%
TOTAL	207.0B	100.0%	N/A	100.0%

⚠️ Important Note

The models are pure language models and not instruction finetuned for dialogue or answering questions.

📄 License

This project is licensed under the Apache - 2.0 license.

📢 More documentation and a paper are coming soon.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご