🚀 Greek (el) GPT2 model
This is a text generation (autoregressive) model based on the English GPT - 2, fine - tuned for the Greek language, offering an efficient solution for Greek text generation.
🚀 Quick Start
The Greek (el) GPT2 model is a fine - tuned version of the English GPT - 2 for the Greek language. It can be easily used with the following code example:
from transformers import pipeline
model = "lighteternal/gpt2-finetuned-greek"
generator = pipeline(
'text-generation',
device=0,
model=f'{model}',
tokenizer=f'{model}')
text = "Μια φορά κι έναν καιρό"
print("\n".join([x.get("generated_text") for x in generator(
text,
max_length=len(text.split(" "))+15,
do_sample=True,
top_k=50,
repetition_penalty = 1.2,
add_special_tokens=False,
num_return_sequences=5,
temperature=0.95,
top_p=0.95)]))
✨ Features
- Fine - tuned for Greek: Based on the English GPT - 2, it has been fine - tuned for the Greek language, which is more suitable for Greek text generation tasks.
- Efficient Training: Fine - tuned with gradual layer unfreezing, providing a more efficient and sustainable alternative compared to training from scratch, especially for low - resource languages.
📦 Installation
The code example uses the transformers
library. You can install it via the following command:
pip install transformers
💻 Usage Examples
Basic Usage
from transformers import pipeline
model = "lighteternal/gpt2-finetuned-greek"
generator = pipeline(
'text-generation',
device=0,
model=f'{model}',
tokenizer=f'{model}')
text = "Μια φορά κι έναν καιρό"
print("\n".join([x.get("generated_text") for x in generator(
text,
max_length=len(text.split(" "))+15,
do_sample=True,
top_k=50,
repetition_penalty = 1.2,
add_special_tokens=False,
num_return_sequences=5,
temperature=0.95,
top_p=0.95)]))
📚 Documentation
Model Information
Property |
Details |
Model Type |
GPT2 (12 - layer, 768 - hidden, 12 - heads, 117M parameters. OpenAI GPT - 2 English model, finetuned for the Greek language) |
Training Data |
~23.4 GB of Greek corpora from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices |
Pre - processing |
Tokenization + BPE segmentation |
Metrics |
Perplexity |
Training data
We used a 23.4GB sample from a consolidated Greek corpus from CC100, Wikimatrix, Tatoeba, Books, SETIMES and GlobalVoices containing long sequences. This is a better version of our GPT - 2 small model (https://huggingface.co/lighteternal/gpt2-finetuned-greek-small)
Metrics
Metric |
Value |
Train Loss |
3.67 |
Validation Loss |
3.83 |
Perplexity |
39.12 |
Acknowledgement
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call). Based on the work of Thomas Dehaene (ML6): https://blog.ml6.eu/dutch-gpt2-autoregressive-language-modelling-on-a-budget-cff3942dd020
🔧 Technical Details
A text generation (autoregressive) model, using Huggingface transformers and fastai based on the English GPT - 2. Finetuned with gradual layer unfreezing. This is a more efficient and sustainable alternative compared to training from scratch, especially for low - resource languages. Based on the work of Thomas Dehaene (ML6) for the creation of a Dutch GPT2: https://colab.research.google.com/drive/1Y31tjMkB8TqKKFlZ5OJ9fcMp3p8suvs4?usp=sharing
📄 License
This model is licensed under the apache - 2.0 license.