GPT2 Turkish 900M Open-Source Large Language Model - Free Deployment to Boost Turkish Text Generation

Home

Gpt2 Turkish 900m

Developed by cenkersisman

A Turkish large language model based on the GPT-2 architecture, designed specifically for Turkish text generation tasks

Large Language Model

Transformers

Other#Turkish text generation #Case-sensitive input #Limited length generation

Downloads 246

Release Time : 8/15/2023

Model Overview

This model is a Turkish large language model built on the GPT-2 architecture, utilizing a specially designed Turkish tokenizer capable of generating human-like text from given initial prompts.

Model Features

Turkish language optimization

Utilizes a tokenizer compliant with Turkish spelling rules, specifically optimized for Turkish text

Limited length generation

Maximum sentence length is limited to 128 tokens, suitable for generating shorter texts

Localized training

Trained on a Turkish Wikipedia dataset containing 900 million characters

Model Capabilities

Turkish text generation

Context continuation

Question-answer generation

Use Cases

Education

Language learning assistance

Provides example sentences and practice materials for Turkish language learners

Content creation

Creative writing

Assists writers in generating creative Turkish text fragments

🚀 GPT-2 Turkish Model

A large data model specialized for the Turkish language, belonging to the LLM (Large Language Model) category.

Widgets

Example Title	Text
The capital city of France	'The capital city of France'
The capital city of England	'The capital city of England'
The capital city of Italy	'The capital city of Italy'
The capital city of Mongolia	'The capital city of Mongolia'
The country where the Amazon rainforests are located	'The country where the Amazon rainforests are located'
The city that connects Europe to Asia	'The city that connects Europe to Asia'
The continent where zebras live	'The continent where zebras live'
The traditional rival of Fenerbahçe	'The traditional rival of Fenerbahçe'
One - legged frog	'One - legged frog'
Rain in Rize	'Rain in Rize'
The meaning of life	'The meaning of life'
Saint - Joseph	'Saint - Joseph'
The names of colors are as follows	'The names of colors are as follows'
Climate change	'Climate change'
Among salty foods	'Among salty foods'

Model Description

The GPT - 2 Turkish Model is a large data model specialized for the Turkish language and belongs to the LLM (Large Language Model) category. This model is based on the GPT - 2 architecture and represents a Turkish language model with a specially prepared tokenizer structure. The model has the ability to generate human - like texts using a given starting text and has been trained on a large Turkish text dataset.

A Wikipedia dataset of 900 million characters was used for training the model. The sentences in the training set consist of a maximum of 128 tokens (token = root word and affixes), so the length of the sentences it can generate is limited. A tokenizer suitable for the Turkish syllable structure was used, and the model was trained for approximately 154 epochs over 7.5 million steps.

An Nvidia Geforce RTX 3050 GPU with 4GB of memory is used for training. A shared 16GB GPU is also utilized, and a total of 20GB of memory is used during the training process.

🚀 Quick Start

📦 Installation

There is no specific installation content provided in the original text.

💻 Usage Examples

⚠️ Important Note

Since the model is case - sensitive, the prompt must be written entirely in lowercase.

# Example code for inference with the model

from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = "cenkersisman/gpt2-turkish-128-token"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

prompt = "okyanusun derinliklerinde bulunan"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100, pad_token_id=tokenizer.eos_token_id)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

🔧 Technical Details

Training Process Curve

image/png

Limitations and Biases

This model was trained as an autoregressive language model. This means that its primary function is to take a sequence of text and predict the next token. Although language models are widely used for many tasks other than this, there are many unknowns related to this work.

The model was trained on a dataset known to contain texts that lead to swear words, explicit content, and inappropriate behavior. Depending on your use case, this model may generate socially unacceptable texts.

As with all language models, it is difficult to predict in advance how this model will respond to a given input, and aggressive content may appear without warning. It is recommended that humans review or filter the outputs to both censor unwanted content and improve the quality of the results before publishing them.

📄 License

There is no license information provided in the original text.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご