GPT2-Persian Open-Source Persian Language Model - Free Generation of Persian Texts and Poems

Gpt2 Persian

Developed by bolbolzaban

A Persian language model based on the GPT2 architecture, specifically designed for Persian text generation with enhanced poetry processing capabilities.

Large Language Model OtherOpen Source License:Apache-2.0 #Persian text generation #Classical poetry processing #SentencePiece tokenization

Downloads 691

Release Time : 3/2/2022

Model Overview

This is a GPT2 model optimized for Persian, using SentencePiece tokenizer, specifically designed for Persian text generation and poetry research.

Model Features

Persian optimization

Specifically trained for Persian, all non-Persian characters are replaced with special tokens

Enhanced poetry processing

Supports special token formats for classical Persian poetry, such as [BOM] and [EOS]

Efficient tokenization

Uses Google SentencePiece tokenizer instead of standard BPE tokenizer

Computational optimization

Context length reduced from 1024 to 256 to lower training costs

Model Capabilities

Persian text generation

Classical poetry continuation

Persian language understanding

Use Cases

Literary creation

Persian poetry generation

Continue writing complete poems based on input Persian verses

Can generate texts that conform to classical Persian poetry metrics

Language research

Persian model research

Used to study the characteristics of Persian language models

🚀 GPT2-Persian

bolbolzaban/gpt2-persian is a GPT2 language model trained with hyperparameters similar to the standard GPT2-medium, with the following differences:

The context size is reduced from 1024 to 256 sub-words to make training more affordable.
Instead of BPE, Google's SentencePiece tokenizer is used for tokenization.
The training dataset only includes Persian text. All non-Persian characters are replaced with special tokens (e.g., [LAT], [URL], [NUM]).

For further details, please refer to this blog post. You can also try the model here or on Bolbolzaban.com.

🚀 Quick Start

✨ Features

Reduced Context Size: The context size is reduced from 1024 to 256 sub-words, making training more cost - effective.
SentencePiece Tokenizer: Utilizes Google's SentencePiece tokenizer instead of BPE for tokenization.
Persian - Only Training Data: The model is trained solely on Persian text, with non - Persian characters replaced by special tokens.

📦 Installation

There is no specific installation step mentioned in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for text generation:

from transformers import pipeline, AutoTokenizer, GPT2LMHeadModel
tokenizer = AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian')
model = GPT2LMHeadModel.from_pretrained('bolbolzaban/gpt2-persian')
generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':256})
sample = generator('در یک اتفاق شگفت انگیز، پژوهشگران')

If you are using Tensorflow, import TFGPT2LMHeadModel instead of GPT2LMHeadModel.

🔧 Technical Details

The model bolbolzaban/gpt2-persian is a GPT2 - based language model. It is trained with hyperparameters similar to the standard GPT2 - medium. However, it has some key differences. The context size is reduced to 256 sub - words from the standard 1024 to make the training process more affordable. Instead of the standard Byte Pair Encoding (BPE) tokenizer, Google's SentencePiece tokenizer is used. The training dataset is composed only of Persian text, and non - Persian characters are replaced with special tokens.

📚 Documentation

Fine - tuning

Find a basic fine - tuning example on this Github Repo.

Special Tokens

The model gpt - persian is trained for research on Persian poetry. All English words and numbers are replaced with special tokens, and only the standard Persian alphabet is used as part of the input text.

For example: Original text: اگر آیفون یا آیپد شما دارای سیستم عامل iOS 14.3 یا iPadOS 14.3 یا نسخه‌های جدیدتر باشد Text used in training: اگر آیفون یا آیپد شما دارای سیستم عامل [LAT] [NUM] یا [LAT] [NUM] یا نسخه‌های جدیدتر باشد

Please consider normalizing your input text using Hazm or similar libraries and ensure only Persian characters are provided as input.

If you want to use classical Persian poetry as input, use [BOM] (beginning of mesra) at the beginning of each verse (مصرع) followed by [EOS] (end of statement) at the end of each couplet (بیت).

See the following links for examples: [BOM] توانا بود [BOM] توانا بود هر که دانا بود [BOM] [BOM] توانا بود هر که دانا بود [BOM] ز دانش دل پیر [BOM] توانا بود هر که دانا بود [BOM] ز دانش دل پیربرنا بود [EOS]

If you like to know about the structure of classical Persian poetry, refer to these blog posts.

📄 License

This project is licensed under the Apache - 2.0 license.

Acknowledgment

This project is supported by Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

Citation and Reference

Please reference the "bolbolzaban.com" website if you are using gpt2 - persian in your research or commercial application.

Contacts

Please reach out on Linkedin or Telegram if you have any questions or need any help to use the model.

Follow Bolbolzaban on Twitter, Telegram or Instagram

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご