Slovak-gpt-j-1.4B open-source Slovak language generation model

Slovak Gpt J 1.4B

Developed by Milos

A large Slovak language generation model with 1.4 billion parameters, based on the GPT - J architecture

OtherOpen Source License:Gpl-3.0 #Slovak language generation #Large model with 1.4 billion parameters #RoPE positional encoding

Downloads 90

Release Time : 3/2/2022

Model Overview

This model is a causal language model based on the GPT - J architecture, specifically optimized for the Slovak language and can be used for text generation and feature extraction

Model Features

Large - scale parameters

It has over 1.4 billion trainable parameters and is the largest model in this series

Slovak language optimization

Specifically trained for the Slovak language, using over 40GB of Slovak language corpora

Rotary position embedding

Adopts the RoPE positional encoding technology to enhance the model's ability to process long sequences

Diverse training data

The training data contains various Slovak language contents such as web pages, news, and the Bible

Model Capabilities

Slovak text generation

Poetry creation

Joke generation

Story continuation

Text feature extraction

Use Cases

Creative writing

Poetry creation

Continue writing Slovak poetry based on prompts

Can generate poetry fragments that conform to Slovak cultural characteristics

Entertainment applications

Joke generation

Generate Slovak jokes

May contain politically incorrect content, use with caution

Educational research

Language model research

Research the characteristics of Slovak language models

🚀 Slovak GPT-J-1.4B

Slovak GPT-J-1.4B, boasting 1,415,283,792 parameters, is the latest and largest model in the Slovak GPT-J series. Smaller variants, Slovak GPT-J-405M and Slovak GPT-J-162M, are still available.

✨ Features

Based on GPT-J, with over 1.4B trainable parameters.
Trained on a privately collected dataset with over 40GB of Slovak text from various categories.
Can be used for text generation from a prompt, similar to the original GPT-J.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

This model along with the tokenizer can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-1.4B")
model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-1.4B")

When generating a prompt, keep in mind these three things:

Never leave trailing whitespaces. There's a difference between how tokenizer encodes "Mám rád slovenčinu" (no space after slovenčinu) and "Mám rád slovenčinu " (trailing space after slovenčinu), i.e [12805, 2872, 46878] != [12805, 2872, 46878, 221].
Always use good ol' US English primary double quotation marks, i.e. "" instead of „“.
In case of a new line, always enter \n\n instead of a single \n

To illustrate an example of a basic text generation:

prompt = "Tradičné jedlo na Orave sú"
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model.generate(**encoded_input)
print(tokenizer.decode(output[0]))

Advanced Usage

Based on the parameter count and internal testing, this is the most capable model in the Slovak GPT-J series. Here are some examples of different prompts and their outputs:

A spin on "Mor ho!" by Samo Chalpuka

PROMPT
Zleteli orly z Tatry, tiahnu na podolia, ponad vysoké hory, ponad rovné polia; preleteli cez Dunaj, cez tú šíru vodu, sadli tam za pomedzím slovenského rodu.

Duní Dunaj a luna za lunou sa valí
OUTPUT
Zleteli orly z Tatry, tiahnu na podolia, ponad vysoké hory, ponad rovné polia; preleteli cez Dunaj, cez tú šíru vodu, sadli tam za pomedzím slovenského rodu.

Duní Dunaj a luna za lunou sa valí: nad ním svieti pevný hrad na vysokom bralí. Pod tým hradom je zem slovenská, na nej je Slovensko naše.

Na tom vrchu leží slovenský národ, pod tým bralom leží krajina slovenská. A pod tým hradom leží i jeho sláva: na tom vrchu, na tom brale leží sláva slovenského národa.

Jokes

PROMPT
Nasledujú najlepšie vtipy.

Prečo
OUTPUT
Nasledujú najlepšie vtipy.

Prečo má policajt pri sebe vždy dva nože? Jeden je na sebaobranu, druhý na boj zblízka.

The infamous unicorn prompt from GPT - 2 paper

PROMPT
V šokujúcom náleze vedci objavili stádo jednorožcov žijúcich v odľahlom, predtým nepreskúmanom údolí, v Andách. Ešte prekvapivejšie pre vedcov bol fakt, že jednorožce hovorili plynule po slovensky.

OUTPUT
V šokujúcom náleze vedci objavili stádo jednorožcov žijúcich v odľahlom, predtým nepreskúmanom údolí, v Andách. Ešte prekvapivejšie pre vedcov bol fakt, že jednorožce hovorili plynule po slovensky.

V Andách ich totiž žilo približne 700 jedincov, ktoré sa živili výhradne materským mliekom. Slováci sú tak pravdepodobne prvými Európanmi, ktorí v Andách stretli jednorožca. "Je to dôkaz, že tieto zvieratá sú inteligentné a že žijú v Andách už stovky rokov," povedal pre webový portál televízie JOJ profesor geológie, geografie a zoológie, Milan Kováč.

Podľa profesora Kováča si v Andách zvieratá vytvárajú svoj vlastný jazyk. Je to zároveň dôkaz, že jednorožce žili v minulosti aj v slovenských pohoriach. "Jednorožce sa tam síce vyskytovali, ale neboli tak dobre preskúmané, ako teraz v Andách."

Na Slovensku však ľudia o jednorožcoch donedávna vedeli veľmi málo.<|endoftext|>

Prompt in Czech

prompt = "Věta nesmí být sprostá a musí být zcela"
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model.generate(**encoded_input, max_length=16)
print(tokenizer.decode(output[0]))

📚 Documentation

Model Description

Property	Details
Model Type	Based on GPT-J
Training Data	A privately collected dataset with over 40GB of Slovak text from various categories

| Property | Details | |----------------------|----------------------------------------------------------------------------------------------------------------------------------------| | \\(n_{parameters}\\) | 1,415,283,792 | | \\(n_{layers}\\) | 24 | | \\(d_{model}\\) | 2048 | | \\(d_{ff}\\) | 16384 | | \\(n_{heads}\\) | 16 | | \\(d_{head}\\) | 256 | | \\(n_{ctx}\\) | 2048 | | \\(n_{vocab}\\) | 50256 (same tokenizer as GPT - 2/3†) | | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) | | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |

† ByteLevelBPETokenizer was trained on the same Slovak corpus.

Training data

Slovak GPT-J models were trained on a privately collected dataset consisting of predominantly Slovak text spanning different categories, e.g. web, news articles or even biblical texts. In total, over 40GB of text data was used to train this model.

The dataset was preprocessed and cleaned in a specific way that involves minor but a few caveats. To achieve the expected performance, refer to the [How to use] section. Keep in mind that despite the effort to remove inappropriate corpus, the model still might generate sensitive content or leak sensitive information.

Training procedure

This model was trained for a bit more than 26.5 billion tokens over 48,001 steps on TPU v3 - 8 pod. The cross - entropy validation loss at the last step was 2.657.

Intended Use

Same as the original GPT - J, Slovak GPT - J learns an inner representation of the language that can be used to extract features useful for downstream tasks. However, the intended use is text generation from a prompt.

How to use

See the [Usage Examples] section for details.

Capabilities, Limitations, and Biases

The model can generate various types of text, but since the dataset contains profanity, politically incorrect language, and (unintentionally) some Czech text, the model can generate them to some extent too.

🔧 Technical Details

This model was trained for a bit more than 26.5 billion tokens over 48,001 steps on TPU v3 - 8 pod. The cross - entropy validation loss at the last step was 2.657.

📄 License

This model is licensed under the GPL - 3.0 license.

📖 Citation and Related Information

This was done as a moonlighting project during summer of 2021 to better understand transformers. If you use this model or have any questions about it, feel free to contact the author at twitter or check out the github profile.

BibTeX entry

To cite this model:

@misc{slovak-gpt-j-1.4B,
  author = {Kondela, Milos},
  title = {{Slovak GPT-J-1.4B}},
  howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-1.4B}},
  year = 2022,
  month = February
}

To cite the codebase that trained this model:

@misc{mesh-transformer-jax,
  author = {Wang, Ben},
  title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
  year = 2021,
  month = May
}

🙏 Acknowledgements

This project was generously supported by TPU Research Cloud (TRC) program. Special thanks also go to Ben Wang and the great EleutherAI community.

⚠️ Important Note

Despite the effort to remove inappropriate corpus, the model still might generate sensitive content or leak sensitive information.

💡 Usage Tip

To achieve the expected performance, refer to the [How to use] section.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご