Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Slovak GPT-J-162M
Slovak GPT-J-162M is the first model in the Slovak GPT-J series. It's the first publicly available transformer mainly trained on a Slovak corpus. After its initial release, two other models were made public: Slovak GPT-J-405M and the largest Slovak GPT-J-1.4B.
✨ Features
- Based on GPT-J with over 162M trainable parameters.
📚 Documentation
Model Description
The model is based on GPT-J and has more than 162 million trainable parameters.
Property | Details |
---|---|
Model Type | Based on GPT-J |
Training Data | Privately collected dataset mainly consisting of Slovak text from different categories, with over 40GB of text data in total. |
† ByteLevelBPETokenizer was trained on the same Slovak corpus.
Training Data
Slovak GPT-J-162M was trained on a privately collected dataset mainly composed of Slovak text from various categories, such as web content, news articles, and even biblical texts. In total, over 40GB of text data was used for training this model. The dataset was preprocessed and cleaned in a specific way with some minor caveats. To achieve the expected performance, refer to the [How to use] section. Note that despite efforts to remove inappropriate parts of the corpus, the model may still generate sensitive content or leak sensitive information.
Training Procedure
This model was trained for nearly 37 billion tokens over 69,001 steps on a TPU v3 - 8 pod. The cross - entropy validation loss at the last step was 3.065.
Intended Use
Similar to the original GPT - J, Slovak GPT - J learns an internal language representation that can be used to extract features for downstream tasks. However, its main purpose is text generation from a prompt.
💻 Usage Examples
Basic Usage
This model and its tokenizer can be easily loaded using the AutoModelForCausalLM
functionality:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-162M")
model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-162M")
When generating a prompt, keep these three points in mind:
- Never leave trailing whitespaces. There's a difference in how the tokenizer encodes "Mám rád slovenčinu" (no space after
slovenčinu
) and "Mám rád slovenčinu " (trailing space afterslovenčinu
), i.e[12805, 2872, 46878]
!=[12805, 2872, 46878, 221]
. - Always use standard US English double quotation marks, i.e.
""
instead of„“
. - For a new line, always enter
\n\n
instead of a single\n
To illustrate basic text generation:
>>> prompt = "Moje najobľubenejšie mesto na severe Slovenska je"
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
>>> output = model.generate(**encoded_input)
>>> tokenizer.decode(output[0])
'Moje najobľubenejšie mesto na severe Slovenska je Žilina.\n\nV Žiline sa nachádza množstvo zaujímavých miest'
Advanced Usage
The model's capabilities are limited due to its relatively small size of only 162M parameters. Its main purpose is for education and fun. Since the dataset contains profanity, politically incorrect language, and even some Czech text, the model may generate such content to some extent. Here's an example when the prompt is in Czech:
>>> prompt = "Věta nesmí být sprostá a musí být zcela"
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
>>> output = model.generate(**encoded_input, max_length=16)
>>> tokenizer.decode(output[0])
'Věta nesmí být sprostá a musí být zcela věrná.'
Citation and Related Information
This was a side project in the summer of 2021 to better understand transformers. Due to limited free time, it wasn't properly open - sourced until now. Based on the model's popularity and interest, more capable and substantially larger versions of Slovak GPT - J models may be released.
If you use this model or have questions, contact me on twitter or check my github profile.
BibTeX entry
To cite this model:
@misc{slovak-gpt-j-162m,
author = {Kondela, Milos},
title = {{Slovak GPT-J-162M}},
howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-162M}},
year = 2022,
month = February
}
To cite the codebase that trained this model:
@misc{mesh-transformer-jax,
author = {Wang, Ben},
title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
year = 2021,
month = May
}
📄 License
This model is released under the GPL - 3.0 license.
Acknowledgements
This project was generously supported by the TPU Research Cloud (TRC) program. Special thanks to Ben Wang and the great EleutherAI community.

