Wangchanberta-base-att-spm-uncased Open-source Model - For Thai Masked Language Modeling and Text Classification

Wangchanberta Base Att Spm Uncased

Developed by airesearch

A RoBERTa BASE model pre-trained on 78.5GB of Thai text, suitable for masked language modeling and text classification tasks

Large Language Model

Transformers

Other#Thai text classification #Social media sentiment analysis #User review rating prediction

Downloads 34.38k

Release Time : 3/2/2022

Model Overview

This model is a Thai pre-trained model based on the RoBERTa architecture, primarily used for masked language modeling prediction and text classification tasks. Supports various downstream NLP applications.

Model Features

Large-scale Thai pre-training

Trained on 78.5GB of Thai text data with excellent Thai language understanding capabilities

Multi-task support

Supports various NLP tasks including masked language modeling, text classification, and sequence labeling

Optimized RoBERTa architecture

Utilizes an improved RoBERTa architecture with enhanced contextual understanding capabilities

Model Capabilities

Thai text understanding

Masked language prediction

Sentiment analysis

Text classification

Sequence labeling

Use Cases

Social media analysis

Sentiment analysis

Analyze sentiment tendencies in Thai social media posts

Can classify into four sentiment categories: positive/neutral/negative/question

Review analysis

Review rating prediction

Predict user review ratings (1-5 stars)

Performs well on the Wongnai review dataset

News classification

Multi-label topic classification

Classify Thai news into 12 topic categories

Achieves multi-label classification on the Prachathai67k dataset

🚀 WangchanBERTa base model: `wangchanberta-base-att-spm-uncased`

A pretrained RoBERTa BASE model on assorted Thai texts (78.5 GB). The script and documentation can be found at this repository.

🚀 Quick Start

The getting started notebook of WangchanBERTa model can be found at this Colab notebook.

✨ Features

Model description

The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019].

Intended uses & limitations

You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task.

Multiclass text classification

wisesight_sentiment 4-class text classification task (positive, neutral, negative, and question) based on social media posts and tweets.
wongnai_reivews Users' review rating classification task (scale is ranging from 1 to 5).
generated_reviews_enth : (review_star as label) Generated users' review rating classification task (scale is ranging from 1 to 5).

Multilabel text classification

prachathai67k Thai topic classification with 12 labels based on news article corpus from prachathai.com. The detail is described in this page.

Token classification

thainer Named-entity recognition tagging with 13 named-entities as described in this page.
lst20 : NER NER and POS tagging Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as described in this page.

🔧 Technical Details

Training data

wangchanberta-base-att-spm-uncased model was pretrained on assorted Thai text dataset. The total size of uncompressed text is 78.5GB.

Preprocessing

Texts are preprocessed with the following rules:

Replace HTML forms of characters with the actual characters such as   with a space and <br /> with a line break [Howard and Ruder, 2018].
Remove empty brackets ((), {}, and []) than sometimes come up as a result of text extraction such as from Wikipedia.
Replace line breaks with spaces.
Replace more than one spaces with a single space.
Remove more than 3 repetitive characters such as ดีมากกก to ดีมาก [Howard and Ruder, 2018].
Word-level tokenization using [Phatthiyaphaibun et al., 2020] ’s newmm dictionary-based maximal matching tokenizer.
Replace repetitive words; this is done post-tokenization unlike [Howard and Ruder, 2018]. since there is no delimitation by space in Thai as in English.
Replace spaces with <_>. The SentencePiece tokenizer combines the spaces with other tokens. Since spaces serve as punctuation in Thai such as sentence boundaries similar to periods in English, combining it with other tokens will omit an important feature for tasks such as word tokenization and sentence breaking. Therefore, we opt to explicitly mark spaces with <_>.

Regarding the vocabulary, we use SentencePiece [Kudo, 2018] to train SentencePiece unigram model. The tokenizer has a vocabulary size of 25,000 subwords, trained on 15M sentences sampled from the training set.

The length of each sequence is limited up to 416 subword tokens. Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with <mask> token. Out of the 15%, 80% is replaced with a <mask> token, 10% is left unchanged and 10% is replaced with a random token.

Train/Val/Test splits

After preprocessing and deduplication, we have a training set of 381,034,638 unique, mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [Phatthiyaphaibun et al., 2020], 8,680,485,067 subwords as tokenized by SentencePiece tokenizer, and 53,035,823,287 characters.

Pretraining

The model was trained on 8 V100 GPUs for 500,000 steps with the batch size of 4,096 (32 sequences per device with 16 accumulation steps) and a sequence length of 416 tokens. The optimizer we used is Adam with the learning rate of $3e-4$, $\beta_1 = 0.9$, $\beta_2= 0.999$ and $\epsilon = 1e-6$. The learning rate is warmed up for the first 24,000 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint.

As of Sun 24 Jan 2021, we release the model from the checkpoint @360,000 steps due to the model pretraining has not yet been completed.

BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご