🚀 GPT2-Persian
bolbolzaban/gpt2-persian is a GPT2 language model trained with hyperparameters similar to the standard GPT2-medium, with the following differences:
- The context size is reduced from 1024 to 256 sub-words to make training more affordable.
- Instead of BPE, Google's SentencePiece tokenizer is used for tokenization.
- The training dataset only includes Persian text. All non-Persian characters are replaced with special tokens (e.g., [LAT], [URL], [NUM]).
For further details, please refer to this blog post. You can also try the model here or on Bolbolzaban.com.
🚀 Quick Start
✨ Features
- Reduced Context Size: The context size is reduced from 1024 to 256 sub-words, making training more cost - effective.
- SentencePiece Tokenizer: Utilizes Google's SentencePiece tokenizer instead of BPE for tokenization.
- Persian - Only Training Data: The model is trained solely on Persian text, with non - Persian characters replaced by special tokens.
📦 Installation
There is no specific installation step mentioned in the original document.
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for text generation:
from transformers import pipeline, AutoTokenizer, GPT2LMHeadModel
tokenizer = AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian')
model = GPT2LMHeadModel.from_pretrained('bolbolzaban/gpt2-persian')
generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':256})
sample = generator('در یک اتفاق شگفت انگیز، پژوهشگران')
If you are using Tensorflow, import TFGPT2LMHeadModel
instead of GPT2LMHeadModel
.
🔧 Technical Details
The model bolbolzaban/gpt2-persian
is a GPT2 - based language model. It is trained with hyperparameters similar to the standard GPT2 - medium. However, it has some key differences. The context size is reduced to 256 sub - words from the standard 1024 to make the training process more affordable. Instead of the standard Byte Pair Encoding (BPE) tokenizer, Google's SentencePiece tokenizer is used. The training dataset is composed only of Persian text, and non - Persian characters are replaced with special tokens.
📚 Documentation
Fine - tuning
Find a basic fine - tuning example on this Github Repo.
Special Tokens
The model gpt - persian
is trained for research on Persian poetry. All English words and numbers are replaced with special tokens, and only the standard Persian alphabet is used as part of the input text.
For example:
Original text: اگر آیفون یا آیپد شما دارای سیستم عامل iOS 14.3 یا iPadOS 14.3 یا نسخههای جدیدتر باشد
Text used in training: اگر آیفون یا آیپد شما دارای سیستم عامل [LAT] [NUM] یا [LAT] [NUM] یا نسخههای جدیدتر باشد
Please consider normalizing your input text using Hazm or similar libraries and ensure only Persian characters are provided as input.
If you want to use classical Persian poetry as input, use [BOM] (beginning of mesra) at the beginning of each verse (مصرع) followed by [EOS] (end of statement) at the end of each couplet (بیت).
See the following links for examples:
[BOM] توانا بود
[BOM] توانا بود هر که دانا بود [BOM]
[BOM] توانا بود هر که دانا بود [BOM] ز دانش دل پیر
[BOM] توانا بود هر که دانا بود [BOM] ز دانش دل پیربرنا بود [EOS]
If you like to know about the structure of classical Persian poetry, refer to these blog posts.
📄 License
This project is licensed under the Apache - 2.0 license.
Acknowledgment
This project is supported by Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).
Citation and Reference
Please reference the "bolbolzaban.com" website if you are using gpt2 - persian in your research or commercial application.
Contacts
Please reach out on Linkedin or Telegram if you have any questions or need any help to use the model.
Follow Bolbolzaban on Twitter, Telegram or Instagram