Slovak - GPT - J - 162M Open - source Model: The First Practical Tool Trained on Slovak Language Corpora

Home

Slovak Gpt J 162M

Developed by Milos

The first publicly available Transformer model trained on Slovak language corpus with 162 million parameters

Large Language Model

Transformers

OtherOpen Source License:Gpl-3.0 #Slovak language generation #Small-scale Transformer #Edutainment scenarios

Downloads 15

Release Time : 3/2/2022

Model Overview

Slovak causal language model based on GPT-J architecture, primarily for text generation tasks

Model Features

Slovak language optimization

The first Transformer model specifically trained and publicly released for Slovak language

Rotary Position Embedding

Utilizes RoPE positional encoding technology to enhance long sequence processing

Large-scale corpus training

Trained on over 40GB of Slovak multi-domain texts

Model Capabilities

Slovak text generation

Linguistic feature extraction

Prompt-based content creation

Use Cases

Edutainment

Text completion

Generates coherent Slovak texts based on given prompts

Example: Input 'Moje najobľubenejšie mesto...' can generate descriptions about Slovak cities

Language learning assistance

Generates Slovak language learning materials or exercises

🚀 Slovak GPT-J-162M

Slovak GPT-J-162M is the first model in the Slovak GPT-J series. It's the first publicly available transformer mainly trained on a Slovak corpus. After its initial release, two other models were made public: Slovak GPT-J-405M and the largest Slovak GPT-J-1.4B.

✨ Features

Based on GPT-J with over 162M trainable parameters.

📚 Documentation

Model Description

The model is based on GPT-J and has more than 162 million trainable parameters.

Property	Details
Model Type	Based on GPT-J
Training Data	Privately collected dataset mainly consisting of Slovak text from different categories, with over 40GB of text data in total.

| Hyperparameter | Value | |----------------------|-------------------------------------------------------------------------------------------------------------------------------| | \\(n_{parameters}\\) | 162,454,608 | | \\(n_{layers}\\) | 12 | | \\(d_{model}\\) | 768 | | \\(d_{ff}\\) | 16384 | | \\(n_{heads}\\) | 16 | | \\(d_{head}\\) | 256 | | \\(n_{ctx}\\) | 2048 | | \\(n_{vocab}\\) | 50256 (same tokenizer as GPT-2/3†) | | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) | | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |

† ByteLevelBPETokenizer was trained on the same Slovak corpus.

Training Data

Slovak GPT-J-162M was trained on a privately collected dataset mainly composed of Slovak text from various categories, such as web content, news articles, and even biblical texts. In total, over 40GB of text data was used for training this model. The dataset was preprocessed and cleaned in a specific way with some minor caveats. To achieve the expected performance, refer to the [How to use] section. Note that despite efforts to remove inappropriate parts of the corpus, the model may still generate sensitive content or leak sensitive information.

Training Procedure

This model was trained for nearly 37 billion tokens over 69,001 steps on a TPU v3 - 8 pod. The cross - entropy validation loss at the last step was 3.065.

Intended Use

Similar to the original GPT - J, Slovak GPT - J learns an internal language representation that can be used to extract features for downstream tasks. However, its main purpose is text generation from a prompt.

💻 Usage Examples

Basic Usage

This model and its tokenizer can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-162M")
model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-162M")

When generating a prompt, keep these three points in mind:

Never leave trailing whitespaces. There's a difference in how the tokenizer encodes "Mám rád slovenčinu" (no space after slovenčinu) and "Mám rád slovenčinu " (trailing space after slovenčinu), i.e [12805, 2872, 46878] != [12805, 2872, 46878, 221].
Always use standard US English double quotation marks, i.e. "" instead of „“.
For a new line, always enter \n\n instead of a single \n

To illustrate basic text generation:

>>> prompt = "Moje najobľubenejšie mesto na severe Slovenska je"
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
>>> output = model.generate(**encoded_input)
>>> tokenizer.decode(output[0])
'Moje najobľubenejšie mesto na severe Slovenska je Žilina.\n\nV Žiline sa nachádza množstvo zaujímavých miest'

Advanced Usage

The model's capabilities are limited due to its relatively small size of only 162M parameters. Its main purpose is for education and fun. Since the dataset contains profanity, politically incorrect language, and even some Czech text, the model may generate such content to some extent. Here's an example when the prompt is in Czech:

>>> prompt = "Věta nesmí být sprostá a musí být zcela"
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
>>> output = model.generate(**encoded_input, max_length=16)
>>> tokenizer.decode(output[0])
'Věta nesmí být sprostá a musí být zcela věrná.'

Citation and Related Information

This was a side project in the summer of 2021 to better understand transformers. Due to limited free time, it wasn't properly open - sourced until now. Based on the model's popularity and interest, more capable and substantially larger versions of Slovak GPT - J models may be released.

If you use this model or have questions, contact me on twitter or check my github profile.

BibTeX entry

To cite this model:

@misc{slovak-gpt-j-162m,
  author = {Kondela, Milos},
  title = {{Slovak GPT-J-162M}},
  howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-162M}},
  year = 2022,
  month = February
}

To cite the codebase that trained this model:

@misc{mesh-transformer-jax,
  author = {Wang, Ben},
  title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
  year = 2021,
  month = May
}

📄 License

This model is released under the GPL - 3.0 license.

Acknowledgements

This project was generously supported by the TPU Research Cloud (TRC) program. Special thanks to Ben Wang and the great EleutherAI community.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご