🚀 Turkish AI Writer based on GPT2-Small
This is an AI writer in Turkish based on the GPT2-small model, offering enhanced text generation capabilities.
🚀 Quick Start
Installation
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
tokenizer.model_max_length=1024
model.eval()
Generate 1 word
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])
print('input text:', text)
print('predicted text:', predicted_text)
Generate Full Sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
sample_outputs = model.generate(inputs.input_ids,
pad_token_id=50256,
do_sample=True,
max_length=50,
top_k=40,
num_return_sequences=1)
for i, sample_output in enumerate(sample_outputs):
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
✨ Features
- This model is an enhanced version of the finetuned gpt2-small-turkish.
- It is trained with the 28-10-2020 Wikipedia Turkish article dump and more than 400 classic Turkish novels and plays (including works by Dostoyevski, Shakespeare, Dumas).
- The base work is based on Pierre Guillou's tutorial (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb).
- Since Turkish is not as close to English as Portuguese, the last 3 layers are trained instead of the last 2.
- The code is converted to work with Fastai 2.X and trained using Google Colab.
- The current accuracy is 36.3%, and the perplexity is 44.75.
- A demo (using CPU inference) is available at http://www.metayazar.com.
- Models are available at gpt2-small-tuned-tr and gpt2-small-turkish-writer.
📚 Documentation
Model description
This model is an enhanced version of the finetuned gpt2-small-turkish. In addition to the 28-10-2020 Wikipedia Turkish article dump, it is trained with more than 400 classic Turkish novels and plays (including works by Dostoyevski, Shakespeare, Dumas).
The base work is based on Pierre Guillou's tutorial on this page (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb).
Note that since Turkish is not as close to English as Portuguese, instead of training the last 2 layers, the last 3 layers are trained.
The code is converted to work with Fastai 2.X and trained using Google Colab.
The current accuracy is 36.3%, and the perplexity is 44.75.
A demo (using CPU inference) is available at http://www.metayazar.com.
Models are available at:
Intended uses & limitations
How to use
See the installation and generation code examples above.
Limitations and bias
The training data for this model comes from Turkish Wikipedia and books. It contains a lot of unfiltered internet content, which is far from neutral. Also, not much pre - processing was done on the books, so chapter names and page numbers may appear in some cases. This is a work in progress.
📦 Datasets
- wikipedia-turkish
- custom-book-corpus
📊 Metrics
Property |
Details |
Model Type |
Enhanced version of gpt2-small-turkish finetuned |
Training Data |
28 - 10 - 2020 Wikipedia Turkish article dump, more than 400 classic Turkish novels and plays |
Accuracy |
36.3% |
Perplexity |
44.75 |
🔧 Technical Details
The model is based on the gpt2-small architecture. Since Turkish is not as close to English as Portuguese, the last 3 layers are trained instead of the last 2. The code is converted to work with Fastai 2.X, and Google Colab is used for training.
📄 License
This project is licensed under the Apache-2.0 license.
Eval results
epoch |
train_loss |
valid_loss |
accuracy |
perplexity |
time |
0 |
4.497828 |
4.549605 |
0.277328 |
94.595070 |
2:09:58 |
1 |
4.503929 |
4.519456 |
0.275071 |
91.785645 |
2:04:30 |
2 |
3.612716 |
3.921146 |
0.344802 |
50.458256 |
2:03:22 |
3 |
3.777645 |
4.072006 |
0.326130 |
58.674530 |
1:56:14 |
4 |
2.934462 |
3.801303 |
0.363719 |
44.759476 |
1:58:55 |
Note: 1cycle rule training is used, and epochs are at different times.