Mid-sized model in the Slovak GPT-J series, primarily for Slovak text generation tasks
Model Features
Large-scale Slovak training
Trained on over 40GB of diverse Slovak text
Rotary position embedding
Uses RoPE position encoding technology to enhance long-text processing
Optimized tokenization
ByteLevelBPETokenizer optimized for Slovak language
Model Capabilities
Slovak text generation
Language feature extraction
Prompt-based content creation
Use Cases
Content generation
Article writing
Generate coherent Slovak articles based on topic prompts
Can produce grammatically correct guide-style articles
Dialogue simulation
Generate Slovak dialogue content
May require parameter adjustment to optimize for repetition issues
Educational assistance
Language learning
Generate Slovak language learning materials
🚀 Slovak GPT-J-405M
Slovak GPT-J-405M is the second model in the Slovak GPT-J series, following its smaller variant Slovak GPT-J-162M. Subsequently, a larger Slovak GPT-J-1.4B was released. It offers enhanced language processing capabilities for Slovak text.
🚀 Quick Start
This model along with the tokenizer can be easily loaded using the AutoModelForCausalLM functionality:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-405M")
model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-405M")
✨ Features
Based on GPT - J: The model is based on GPT - J and has over 405M trainable parameters.
Text Generation: Similar to the original GPT - J, it can be used for text generation from a prompt and learn an inner representation of the language for downstream tasks.
📚 Documentation
Model Description
The model is based on GPT - J and has over 405M trainable parameters.
† ByteLevelBPETokenizer was trained on the same Slovak corpus.
Training data
Slovak GPT - J models were trained on a privately collected dataset consisting of predominantly Slovak text spanning different categories, e.g. web, news articles or even biblical texts - in total, over 40GB of text data was used to train this model.
The dataset was preprocessed and cleaned in a specific way that involves minor but a few caveats, so in order to achieve the expected performance, feel free to refer to the [How to use] section. Please, keep in mind that despite the effort to remove inappropriate corpus, the model still might generate sensitive content or leak sensitive information.
Training procedure
This model was trained for a bit more than 36.5 billion tokens over 69,001 steps on TPU v3 - 8 pod. The cross - entropy validation loss at the last step was 2.821.
Intended Use
Same as the original GPT - J, Slovak GPT - J learns an inner representation of the language that can be used to extract features useful for downstream tasks, however, the intended use is text generation from a prompt.
How to use
When generating a prompt, keep these three things in mind:
Never leave trailing whitespaces. There's a difference between how the tokenizer encodes "Mám rád slovenčinu" (no space after slovenčinu) and "Mám rád slovenčinu " (trailing space after slovenčinu), i.e [12805, 2872, 46878] != [12805, 2872, 46878, 221].
Always use good ol' US English primary double quotation marks, i.e. "" instead of „“.
In case of a new line, always enter \n\n instead of a single \n
To illustrate an example of basic text generation:
>>> prompt = "Tradičné jedlo na Orave sú"
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
>>> output = model.generate(**encoded_input)
>>> tokenizer.decode(output[0])
'Tradičné jedlo na Orave sú bryndzové halušky\n\nNa Orave sa v minulosti varilo viac druhov'
Capabilities, Limitations, and Biases
The model can generate interesting and grammatically correct content with relative ease, despite having only 405M parameters. However, relying on it to produce factually correct information isn't recommended.
GPT models can (and often will) get into a repeating cycle of generating the same content. See generate's documentation on how to introduce a frequency/repetition penalty.
Since the dataset contains profanity, politically incorrect language, and (unintentionally) even some text in Czech, the model can generate such content to some extent.
Citation and Related Information
This was done as a moonlighting project during the summer of 2021 to better understand transformers. If you use this model or have any questions about it, feel free to contact the author at twitter or check out the github profile.
BibTeX entry
To cite this model:
@misc{slovak-gpt-j-405m,
author = {Kondela, Milos},
title = {{Slovak GPT-J-405M}},
howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-405M}},
year = 2022,
month = February
}
To cite the codebase that trained this model:
@misc{mesh-transformer-jax,
author = {Wang, Ben},
title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
year = 2021,
month = May
}