GPT2 Turkish Writer Open-Source Text Generation Model - Empowering Turkish Literary Creation

Gpt2 Turkish Writer

Developed by gorkemgoknar

This is a Turkish text generation model fine-tuned based on the GPT2-Small architecture, specifically optimized for Turkish literary creation.

Large Language Model OtherOpen Source License:Apache-2.0 #Turkish text generation #Literary creation enhancement #Multi-level fine-tuning

Downloads 75

Release Time : 3/2/2022

Model Overview

This model is a text generation model fine-tuned for Turkish based on GPT2-small, trained on data including Turkish Wikipedia and over 400 classic Turkish literary works, focusing on Turkish literary creation and text generation tasks.

Model Features

Turkish Literature Optimization

Specifically trained for the characteristics of Turkish literature, incorporating training data from over 400 classic literary works.

Multi-layer Fine-tuning

Given the significant differences between Turkish and English, the last 3 layers were trained instead of the standard last 2 layers.

High-quality Training Data

Uses Turkish Wikipedia and classic literary works as training data to improve generation quality.

Model Capabilities

Turkish text generation

Literary creation

Story continuation

Context-aware writing

Use Cases

Literary Creation

Story Generation

Generates complete Turkish stories based on given prompts.

Produces coherent texts that match literary styles.

Text Continuation

Performs logical continuation based on user-provided text fragments.

Maintains contextual consistency for smooth continuations.

Educational Applications

Turkish Language Learning

Generates Turkish learning materials and example texts.

Provides authentic Turkish expressions.

🚀 Turkish AI Writer based on GPT2-Small

This is an AI writer in Turkish based on the GPT2-small model, offering enhanced text generation capabilities.

🚀 Quick Start

Installation

from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-turkish-writer")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-turkish-writer")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

Generate 1 word

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt") 

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: 
# predicted text:

Generate Full Sequence

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
#

✨ Features

This model is an enhanced version of the finetuned gpt2-small-turkish.
It is trained with the 28-10-2020 Wikipedia Turkish article dump and more than 400 classic Turkish novels and plays (including works by Dostoyevski, Shakespeare, Dumas).
The base work is based on Pierre Guillou's tutorial (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb).
Since Turkish is not as close to English as Portuguese, the last 3 layers are trained instead of the last 2.
The code is converted to work with Fastai 2.X and trained using Google Colab.
The current accuracy is 36.3%, and the perplexity is 44.75.
A demo (using CPU inference) is available at http://www.metayazar.com.
Models are available at gpt2-small-tuned-tr and gpt2-small-turkish-writer.

📚 Documentation

Model description

This model is an enhanced version of the finetuned gpt2-small-turkish. In addition to the 28-10-2020 Wikipedia Turkish article dump, it is trained with more than 400 classic Turkish novels and plays (including works by Dostoyevski, Shakespeare, Dumas).

The base work is based on Pierre Guillou's tutorial on this page (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb).

Note that since Turkish is not as close to English as Portuguese, instead of training the last 2 layers, the last 3 layers are trained.

The code is converted to work with Fastai 2.X and trained using Google Colab.

The current accuracy is 36.3%, and the perplexity is 44.75.

A demo (using CPU inference) is available at http://www.metayazar.com.

Models are available at:

Intended uses & limitations

How to use

See the installation and generation code examples above.

Limitations and bias

The training data for this model comes from Turkish Wikipedia and books. It contains a lot of unfiltered internet content, which is far from neutral. Also, not much pre - processing was done on the books, so chapter names and page numbers may appear in some cases. This is a work in progress.

📦 Datasets

wikipedia-turkish
custom-book-corpus

📊 Metrics

Property	Details
Model Type	Enhanced version of gpt2-small-turkish finetuned
Training Data	28 - 10 - 2020 Wikipedia Turkish article dump, more than 400 classic Turkish novels and plays
Accuracy	36.3%
Perplexity	44.75

🔧 Technical Details

The model is based on the gpt2-small architecture. Since Turkish is not as close to English as Portuguese, the last 3 layers are trained instead of the last 2. The code is converted to work with Fastai 2.X, and Google Colab is used for training.

📄 License

This project is licensed under the Apache-2.0 license.

Eval results

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.497828	4.549605	0.277328	94.595070	2:09:58
1	4.503929	4.519456	0.275071	91.785645	2:04:30
2	3.612716	3.921146	0.344802	50.458256	2:03:22
3	3.777645	4.072006	0.326130	58.674530	1:56:14
4	2.934462	3.801303	0.363719	44.759476	1:58:55

Note: 1cycle rule training is used, and epochs are at different times.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご