Slovak-gpt-j-405M: An Open-Source Slovak Language Generation Model

Home

Slovak Gpt J 405M

Developed by Milos

A 405-million-parameter Slovak language generation model based on GPT-J architecture, trained on diverse text types

Large Language Model

Transformers

OtherOpen Source License:Gpl-3.0 #Slovak language generation #Large language model #Text creation

Downloads 7,016

Release Time : 3/2/2022

Model Overview

Mid-sized model in the Slovak GPT-J series, primarily for Slovak text generation tasks

Model Features

Large-scale Slovak training

Trained on over 40GB of diverse Slovak text

Rotary position embedding

Uses RoPE position encoding technology to enhance long-text processing

Optimized tokenization

ByteLevelBPETokenizer optimized for Slovak language

Model Capabilities

Slovak text generation

Language feature extraction

Prompt-based content creation

Use Cases

Content generation

Article writing

Generate coherent Slovak articles based on topic prompts

Can produce grammatically correct guide-style articles

Dialogue simulation

Generate Slovak dialogue content

May require parameter adjustment to optimize for repetition issues

Educational assistance

Language learning

Generate Slovak language learning materials

🚀 Slovak GPT-J-405M

Slovak GPT-J-405M is the second model in the Slovak GPT-J series, following its smaller variant Slovak GPT-J-162M. Subsequently, a larger Slovak GPT-J-1.4B was released. It offers enhanced language processing capabilities for Slovak text.

🚀 Quick Start

This model along with the tokenizer can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-405M")
model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-405M")

✨ Features

Based on GPT - J: The model is based on GPT - J and has over 405M trainable parameters.
Text Generation: Similar to the original GPT - J, it can be used for text generation from a prompt and learn an inner representation of the language for downstream tasks.

📚 Documentation

Model Description

The model is based on GPT - J and has over 405M trainable parameters.

Property	Details
Model Type	Based on GPT - J
Trainable Parameters	Over 405M

Hyperparameter	Value
\(n_{parameters}\)	405,677,136
\(n_{layers}\)	24
\(d_{model}\)	1024
\(d_{ff}\)	16384
\(n_{heads}\)	16
\(d_{head}\)	256
\(n_{ctx}\)	2048
\(n_{vocab}\)	50256 (same tokenizer as GPT - 2/3†)
Positional Encoding	Rotary Position Embedding (RoPE)
RoPE Dimensions	64

† ByteLevelBPETokenizer was trained on the same Slovak corpus.

Training data

Slovak GPT - J models were trained on a privately collected dataset consisting of predominantly Slovak text spanning different categories, e.g. web, news articles or even biblical texts - in total, over 40GB of text data was used to train this model. The dataset was preprocessed and cleaned in a specific way that involves minor but a few caveats, so in order to achieve the expected performance, feel free to refer to the [How to use] section. Please, keep in mind that despite the effort to remove inappropriate corpus, the model still might generate sensitive content or leak sensitive information.

Training procedure

This model was trained for a bit more than 36.5 billion tokens over 69,001 steps on TPU v3 - 8 pod. The cross - entropy validation loss at the last step was 2.821.

Intended Use

Same as the original GPT - J, Slovak GPT - J learns an inner representation of the language that can be used to extract features useful for downstream tasks, however, the intended use is text generation from a prompt.

How to use

When generating a prompt, keep these three things in mind:

Never leave trailing whitespaces. There's a difference between how the tokenizer encodes "Mám rád slovenčinu" (no space after slovenčinu) and "Mám rád slovenčinu " (trailing space after slovenčinu), i.e [12805, 2872, 46878] != [12805, 2872, 46878, 221].
Always use good ol' US English primary double quotation marks, i.e. "" instead of „“.
In case of a new line, always enter \n\n instead of a single \n

To illustrate an example of basic text generation:

>>> prompt = "Tradičné jedlo na Orave sú"
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
>>> output = model.generate(**encoded_input)
>>> tokenizer.decode(output[0])
'Tradičné jedlo na Orave sú bryndzové halušky\n\nNa Orave sa v minulosti varilo viac druhov'

Capabilities, Limitations, and Biases

The model can generate interesting and grammatically correct content with relative ease, despite having only 405M parameters. However, relying on it to produce factually correct information isn't recommended.

GPT models can (and often will) get into a repeating cycle of generating the same content. See generate's documentation on how to introduce a frequency/repetition penalty.

Since the dataset contains profanity, politically incorrect language, and (unintentionally) even some text in Czech, the model can generate such content to some extent.

Citation and Related Information

This was done as a moonlighting project during the summer of 2021 to better understand transformers. If you use this model or have any questions about it, feel free to contact the author at twitter or check out the github profile.

BibTeX entry

To cite this model:

@misc{slovak-gpt-j-405m,
  author = {Kondela, Milos},
  title = {{Slovak GPT-J-405M}},
  howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-405M}},
  year = 2022,
  month = February
}

To cite the codebase that trained this model:

@misc{mesh-transformer-jax,
  author = {Wang, Ben},
  title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
  year = 2021,
  month = May
}

Acknowledgements

This project was generously supported by TPU Research Cloud (TRC) program. Special thanks also go to Ben Wang and the great EleutherAI community.

📄 License

This project is licensed under the gpl - 3.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご