đ Melayu BERT
Melayu BERT is a masked language model based on BERT. It addresses the need for a high - performance language model in the Malay language. By leveraging the power of BERT architecture and fine - tuning on Malaysian datasets, it offers accurate language understanding capabilities for Malay text processing.
đ Quick Start
Melayu BERT can be easily used with the Hugging Face Transformers library. You can utilize it for masked language tasks right away.
⨠Features
- Based on BERT: Built upon the well - known BERT architecture, which provides strong language understanding capabilities.
- Trained on OSCAR: The model was trained on the OSCAR dataset, specifically the
unshuffled_original_ms
subset, ensuring a rich and diverse training corpus.
- Fine - tuned on Malaysian Data: Starting from an English BERT model, it was fine - tuned on Malaysian datasets to better adapt to the Malay language.
- Low Perplexity: Achieved a perplexity of 9.46 on a 20% validation dataset, indicating good generalization ability.
- Multi - framework Support: Available for both PyTorch and TensorFlow use.
đĻ Installation
No specific installation steps are provided in the original document. If you want to use the model, you need to have the transformers
library installed. You can install it using the following command:
pip install transformers
đģ Usage Examples
Basic Usage
As Masked Language Model
from transformers import pipeline
pretrained_name = "StevenLimcorn/MelayuBERT"
fill_mask = pipeline(
"fill-mask",
model=pretrained_name,
tokenizer=pretrained_name
)
fill_mask("Saya [MASK] makan nasi hari ini.")
Import Tokenizer and Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("StevenLimcorn/MelayuBERT")
model = AutoModelForMaskedLM.from_pretrained("StevenLimcorn/MelayuBERT")
đ§ Technical Details
The model was trained on 3 epochs with a learning rate of 2e - 3. The training loss per steps is as follows:
Step |
Training loss |
500 |
5.051300 |
1000 |
3.701700 |
1500 |
3.288600 |
2000 |
3.024000 |
2500 |
2.833500 |
3000 |
2.741600 |
3500 |
2.637900 |
4000 |
2.547900 |
4500 |
2.451500 |
5000 |
2.409600 |
5500 |
2.388300 |
6000 |
2.351600 |
Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger, and fine - tuning tutorial notebook written by Pierre Guillou.
đ License
This project is licensed under the MIT license.
đ Documentation
Model Information
Property |
Details |
Model Type |
Masked language model based on BERT |
Training Data |
OSCAR dataset (unshuffled_original_ms subset) |
Widget
You can test the model with the following input:
{
"text": "Saya [MASK] makan nasi hari ini."
}
Author
Melayu BERT was trained by Steven Limcorn and Wilson Wongso.