MelayuBERT Open-Source Malay Language Model - Free to Use, Supports Multi-Framework Text Processing

Melayubert

Developed by StevenLimcorn

A Malay masked language model based on the BERT architecture, trained on the Malay subset of the OSCAR dataset, supporting PyTorch and TensorFlow frameworks.

Large Language Model

Transformers

OtherOpen Source License:MIT #Malay Masked Prediction #Multi-framework Support #Low Perplexity

Downloads 15

Release Time : 3/2/2022

Model Overview

This model is a specialized masked language model for Malay, developed based on the BERT architecture, primarily used for masked prediction tasks in Malay text.

Model Features

Based on BERT Architecture

Utilizes the classic BERT architecture to ensure robust performance when processing Malay text.

Trained on OSCAR Dataset

The model is trained on the Malay subset of the OSCAR dataset, ensuring data diversity and comprehensiveness.

Supports PyTorch and TensorFlow

Compatible with two major deep learning frameworks, facilitating usage in different environments.

Model Capabilities

Masked Language Prediction

Malay Text Processing

Use Cases

Natural Language Processing

Malay Text Completion

Used to predict and complete masked portions in Malay text.

Achieved a perplexity score of 9.46 on the validation set.

🚀 Melayu BERT

Melayu BERT is a masked language model based on BERT. It addresses the need for a high - performance language model in the Malay language. By leveraging the power of BERT architecture and fine - tuning on Malaysian datasets, it offers accurate language understanding capabilities for Malay text processing.

🚀 Quick Start

Melayu BERT can be easily used with the Hugging Face Transformers library. You can utilize it for masked language tasks right away.

✨ Features

Based on BERT: Built upon the well - known BERT architecture, which provides strong language understanding capabilities.
Trained on OSCAR: The model was trained on the OSCAR dataset, specifically the unshuffled_original_ms subset, ensuring a rich and diverse training corpus.
Fine - tuned on Malaysian Data: Starting from an English BERT model, it was fine - tuned on Malaysian datasets to better adapt to the Malay language.
Low Perplexity: Achieved a perplexity of 9.46 on a 20% validation dataset, indicating good generalization ability.
Multi - framework Support: Available for both PyTorch and TensorFlow use.

📦 Installation

No specific installation steps are provided in the original document. If you want to use the model, you need to have the transformers library installed. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

As Masked Language Model

from transformers import pipeline
pretrained_name = "StevenLimcorn/MelayuBERT"
fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)
fill_mask("Saya [MASK] makan nasi hari ini.")

Import Tokenizer and Model

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("StevenLimcorn/MelayuBERT")

model = AutoModelForMaskedLM.from_pretrained("StevenLimcorn/MelayuBERT")

🔧 Technical Details

The model was trained on 3 epochs with a learning rate of 2e - 3. The training loss per steps is as follows:

Step	Training loss
500	5.051300
1000	3.701700
1500	3.288600
2000	3.024000
2500	2.833500
3000	2.741600
3500	2.637900
4000	2.547900
4500	2.451500
5000	2.409600
5500	2.388300
6000	2.351600

Many of the techniques used are based on a Hugging Face tutorial notebook written by Sylvain Gugger, and fine - tuning tutorial notebook written by Pierre Guillou.

📄 License

This project is licensed under the MIT license.

📚 Documentation

Model Information

Property	Details
Model Type	Masked language model based on BERT
Training Data	OSCAR dataset (`unshuffled_original_ms` subset)

Widget

You can test the model with the following input:

{
    "text": "Saya [MASK] makan nasi hari ini."
}

Author

Melayu BERT was trained by Steven Limcorn and Wilson Wongso.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご