gpt2-small-turkish Open-Source Model - Free Deployment to Assist Turkish Text Generation

Gpt2 Small Turkish

Developed by gorkemgoknar

This is a fine-tuned version of the GPT2-Small English model, trained on Turkish Wikipedia articles, suitable for Turkish text generation tasks.

Large Language Model OtherOpen Source License:Apache-2.0 #Turkish text generation #Wikipedia fine-tuning #Text continuation

Downloads 545

Release Time : 3/2/2022

Model Overview

This model is a Turkish text generation model based on the GPT2 architecture, primarily used for Turkish text auto-completion and generation tasks.

Model Features

Turkish Language Optimization

Specially fine-tuned for Turkish, improving the quality of Turkish text generation.

Wikipedia-based Training

Trained on Turkish Wikipedia articles, possessing rich linguistic knowledge.

Multi-length Support

Supports sequences up to 1024 tokens, suitable for generating longer text content.

Model Capabilities

Turkish text generation

Text auto-completion

Language model prediction

Use Cases

Content Creation

Automatic Article Writing

Generates complete Turkish articles based on given prompts

Text Completion

Auto-completes sentences or paragraphs based on partial input

Education

Language Learning Assistance

Helps Turkish learners generate example sentences and texts

🚀 Turkish GPT2 Model Finetuned

This is a GPT2-Small English based model finetuned with Turkish Wikipedia articles, offering text generation capabilities in Turkish.

✨ Features

Finetuned Model: Based on GPT2-Small and finetuned with Turkish Wikipedia articles as of 28-10-2020.
Live Demo: A live demo is available at https://www.metayazar.com/.
Fine-tuned Writer: A fine-tuned writer based on this model can be found at https://huggingface.co/gorkemgoknar/gpt2-turkish-writer.
Code Compatibility: The code is converted to work with Fastai 2.X.
Training Environment: Trained using Google Colab.

📦 Installation

from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

💻 Usage Examples

Basic Usage

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: 
# predicted text:

Advanced Usage

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
#

📚 Documentation

Model description

This is a GPT2-Small English based model finetuned and additionally trained with Wikipedia Articles in Turkish as of 28-10-2020.

The work is based on Pierre Guillou's tutorial on this page. The code is converted to work with Fastai 2.X, and Google Colab is used for training.

The current accuracy is 33%, and the perplexity is 51.88.

Available models:

Limitations and bias

The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.

Training data

Wikipedia Turkish article dump as of 28-10-2020

Eval results

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.777015	4.621834	0.292547	101.680367	2:42:05
1	4.509412	4.403999	0.305574	81.777267	1:09:38
2	4.169529	4.120755	0.324908	61.605747	1:07:45
3	4.293973	4.177899	0.317211	65.228653	1:07:02
4	4.049848	3.949103	0.338347	51.888783	1:05:53

#Epoch 0 on Tesla T4, others on V100

📄 License

This model is licensed under the Apache-2.0 license.

📦 Model Information

Property	Details
Model Type	GPT2-Small finetuned for Turkish
Training Data	Wikipedia Turkish article dump as of 28-10-2020
Metrics	Perplexity, Accuracy
Tags	gpt2, turkish
Live Demo	https://www.metayazar.com/
Fine-tuned Writer	https://huggingface.co/gorkemgoknar/gpt2-turkish-writer

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご