🚀 Turkish GPT2 Model Finetuned
This is a GPT2-Small English based model finetuned with Turkish Wikipedia articles, offering text generation capabilities in Turkish.
✨ Features
- Finetuned Model: Based on GPT2-Small and finetuned with Turkish Wikipedia articles as of 28-10-2020.
- Live Demo: A live demo is available at https://www.metayazar.com/.
- Fine-tuned Writer: A fine-tuned writer based on this model can be found at https://huggingface.co/gorkemgoknar/gpt2-turkish-writer.
- Code Compatibility: The code is converted to work with Fastai 2.X.
- Training Environment: Trained using Google Colab.
📦 Installation
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")
tokenizer.model_max_length=1024
model.eval()
💻 Usage Examples
Basic Usage
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])
print('input text:', text)
print('predicted text:', predicted_text)
Advanced Usage
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")
sample_outputs = model.generate(inputs.input_ids,
pad_token_id=50256,
do_sample=True,
max_length=50,
top_k=40,
num_return_sequences=1)
for i, sample_output in enumerate(sample_outputs):
print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))
📚 Documentation
Model description
This is a GPT2-Small English based model finetuned and additionally trained with Wikipedia Articles in Turkish as of 28-10-2020.
The work is based on Pierre Guillou's tutorial on this page. The code is converted to work with Fastai 2.X, and Google Colab is used for training.
The current accuracy is 33%, and the perplexity is 51.88.
Available models:
Limitations and bias
The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.
Training data
Wikipedia Turkish article dump as of 28-10-2020
Eval results
epoch |
train_loss |
valid_loss |
accuracy |
perplexity |
time |
0 |
4.777015 |
4.621834 |
0.292547 |
101.680367 |
2:42:05 |
1 |
4.509412 |
4.403999 |
0.305574 |
81.777267 |
1:09:38 |
2 |
4.169529 |
4.120755 |
0.324908 |
61.605747 |
1:07:45 |
3 |
4.293973 |
4.177899 |
0.317211 |
65.228653 |
1:07:02 |
4 |
4.049848 |
3.949103 |
0.338347 |
51.888783 |
1:05:53 |
#Epoch 0 on Tesla T4, others on V100
📄 License
This model is licensed under the Apache-2.0 license.
📦 Model Information