MyanBERTa Open-Source Burmese Language Model - Empowering Burmese Content Processing and Understanding

Myanberta

Developed by UCSYNLP

MyanBERTa is a Burmese pre-trained language model based on the BERT architecture, pre-trained on a Burmese dataset containing 5,992,299 sentences.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Burmese BERT #Low-resource NLP #Byte-level BPE Tokenization

Downloads 91

Release Time : 7/25/2022

Model Overview

This model is a pre-trained language model specifically designed for Burmese, utilizing the BERT architecture and byte-level BPE tokenizer, suitable for various Burmese natural language processing tasks.

Model Features

Burmese-specific

Specially designed and optimized for Burmese, better handling the linguistic characteristics of the language.

Large-scale Pre-training

Pre-trained on a large-scale Burmese dataset containing 5,992,299 sentences (136 million words).

Efficient Tokenization

Utilizes byte-level BPE tokenizer, learning 30,522 subword units as the tokenization tool.

Model Capabilities

Burmese Text Understanding

Burmese Text Generation

Burmese Language Feature Extraction

Use Cases

Natural Language Processing

Burmese Text Classification

Perform sentiment analysis or topic classification on Burmese texts

Burmese Question Answering System

Build intelligent Q&A applications based on Burmese

🚀 MyanBERTa - A Pre - trained Language Model for Myanmar

MyanBERTa is a Myanmar pre - trained language model based on BERT. It offers a powerful solution for natural language processing tasks in the Myanmar language, leveraging pre - training on a large dataset.

🚀 Quick Start

No specific quick - start steps are provided in the original document.

✨ Features

BERT - based: Built on the BERT architecture, which has proven effective in various NLP tasks.
Large - scale pre - training: Pre - trained for 528K steps on a dataset with 5,992,299 sentences (136M words).
Byte - level BPE tokenizer: Uses a byte - level BPE tokenizer with 30,522 subword units after word segmentation.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

No usage examples are provided in the original document.

📚 Documentation

Model description

This model is a BERT based Myanmar pre - trained language model. MyanBERTa was pre - trained for 528K steps on a word segmented Myanmar dataset consisting of 5,992,299 sentences (136M words). As the tokenizer, byte - leve BPE tokenizer of 30,522 subword units which is learned after word segmentation is applied.

Citation

Cite this work as:

Aye Mya Hlaing, Win Pa Pa, "MyanBERTa: A Pre - trained Language Model For
Myanmar", In Proceedings of 2022 International Conference on Communication and Computer Research (ICCR2022), November 2022, Seoul, Republic of Korea

Download Paper

🔧 Technical Details

The model is pre - trained on a Myanmar dataset. The pre - training process involves 528K steps on a word - segmented dataset. The tokenizer used is a byte - level BPE tokenizer with 30,522 subword units, which is learned after word segmentation.

📄 License

The model is released under the apache - 2.0 license.

📋 Model Information

Property	Details
Model Type	BERT - based Myanmar pre - trained language model
Training Data	MyCorpus, Web (a word segmented Myanmar dataset with 5,992,299 sentences (136M words))

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご