🚀 GPT-2 Base Thai
GPT-2 Base Thai is a causal language model that addresses the need for Thai language processing, leveraging the power of the OpenAI GPT - 2 architecture to offer high - quality language generation and feature extraction capabilities.
🚀 Quick Start
GPT-2 Base Thai is a causal language model based on the OpenAI GPT-2 model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_th
subset. The model was trained from scratch and achieved an evaluation loss of 1.708 and an evaluation perplexity of 5.516.
This model was trained using HuggingFace's Flax framework and is part of the JAX/Flax Community Week organized by HuggingFace. All training was done on a TPUv3 - 8 VM, sponsored by the Google Cloud team.
All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.
✨ Features
- Based on the OpenAI GPT - 2 architecture, suitable for Thai language tasks.
- Trained from scratch on the
unshuffled_deduplicated_th
subset of the OSCAR dataset.
- Achieved specific evaluation loss and perplexity metrics.
- Trained using HuggingFace's Flax framework during the JAX/Flax Community Week.
📚 Documentation
Model
Property |
Details |
Model Type |
gpt2-base-thai |
#params |
124M |
Architecture |
GPT - 2 |
Training Data |
unshuffled_deduplicated_th Dataset |
Evaluation Results
The model was trained for 3 epochs and the following is the final result once the training ended.
Property |
Details |
Train Loss |
1.638 |
Valid Loss |
1.708 |
Valid PPL |
5.516 |
Total Time |
6:12:34 |
💻 Usage Examples
Basic Usage
As Causal Language Model
from transformers import pipeline
pretrained_name = "flax-community/gpt2-base-thai"
nlp = pipeline(
"text-generation",
model=pretrained_name,
tokenizer=pretrained_name
)
nlp("สวัสดีตอนเช้า")
Feature Extraction in PyTorch
from transformers import GPT2Model, GPT2TokenizerFast
pretrained_name = "flax-community/gpt2-base-thai"
model = GPT2Model.from_pretrained(pretrained_name)
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_name)
prompt = "สวัสดีตอนเช้า"
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)
📄 License
This project is licensed under the MIT license.