Wangchanberta-base-wiki-newmm Open Source Model - A Must-Have for Free Thai Text Task Processing!

Wangchanberta Base Wiki Newmm

Developed by airesearch

A RoBERTa BASE model pretrained on Thai Wikipedia, suitable for Thai text processing tasks

Large Language Model Other#Thai Pretraining #Wikipedia Corpus #RoBERTa Architecture

Downloads 115

Release Time : 3/2/2022

Model Overview

This model is a RoBERTa BASE architecture model pretrained on the Thai Wikipedia corpus, primarily used for masked language modeling tasks in Thai, and also applicable for text classification and token classification tasks.

Model Features

Thai Language Optimization

Specifically pretrained and optimized for Thai text

Multi-task Support

Supports various downstream tasks including text classification and named entity recognition

Large-scale Pretraining

Pretrained on a large-scale Thai Wikipedia corpus

Model Capabilities

Masked Language Modeling

Text Classification

Named Entity Recognition

Part-of-Speech Tagging

Use Cases

Sentiment Analysis

Social Media Sentiment Analysis

Analyze sentiment tendencies in social media posts and tweets

Supports 4 sentiment categories (Positive, Neutral, Negative, Question)

Review Analysis

User Review Rating Prediction

Predict star ratings (1-5 stars) for user reviews

News Classification

News Topic Classification

Multi-label topic classification for news articles

Supports 12 topic labels

Information Extraction

Named Entity Recognition

Identify named entities from text

Supports 13 named entity types

🚀 WangchanBERTa base model: `wangchanberta-base-wiki-newmm`

A pretrained RoBERTa BASE model on the Thai Wikipedia corpus. The script and documentation can be found at this repository.

🚀 Quick Start

The getting started notebook of WangchanBERTa model can be found at this Colab notebook.

✨ Features

Model description

The architecture of the pretrained model is based on RoBERTa [Liu et al., 2019].

Intended uses & limitations

You can use the pretrained model for masked language modeling (i.e. predicting a mask token in the input text). In addition, we also provide finetuned models for multiclass/multilabel text classification and token classification task.

Multiclass text classification

wisesight_sentiment: 4-class text classification task (positive, neutral, negative, and question) based on social media posts and tweets.
wongnai_reivews: Users' review rating classification task (scale is ranging from 1 to 5).
generated_reviews_enth (review_star as label): Generated users' review rating classification task (scale is ranging from 1 to 5).

Multilabel text classification

prachathai67k: Thai topic classification with 12 labels based on news article corpus from prachathai.com. The detail is described in this page.

Token classification

thainer: Named-entity recognition tagging with 13 named-entities as described in this page.
lst20 (NER and POS tagging): Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as described in this page.

🔧 Technical Details

Training data

wangchanberta-base-wiki-newmm model was pretrained on Thai Wikipedia. Specifically, we use the Wikipedia dump articles on 20 August 2020 (dumps.wikimedia.org/thwiki/20200820/). We opt out lists, and tables.

Preprocessing

Texts are preprocessed with the following rules:

Replace non-breaking space, zero-width non-breaking space, and soft hyphen with spaces.
Remove an empty parenthesis that occur right after the title of the first paragraph.
Replace spaces with <_>.

Regarding the vocabulary, we use word-level token from PyThaiNLP's dictionary-based tokenizer named newmm. The total number of word-level tokens in the vocabulary is 97,982.

We sample sentences contiguously to have the length of at most 512 tokens. For some sentences that overlap the boundary of 512 tokens, we split such sentence with an additional token as document separator. This is the same approach as proposed by [Liu et al., 2019] (called "FULL-SENTENCES").

Regarding the masking procedure, for each sequence, we sampled 15% of the tokens and replace them with token. Out of the 15%, 80% is replaced with a token, 10% is left unchanged and 10% is replaced with a random token.

Train/Val/Test splits

We split sequentially 944,782 sentences for training set, 24,863 sentences for validation set and 24,862 sentences for test set.

Pretraining

The model was trained on 32 V100 GPUs for 31,250 steps with the batch size of 8,192 (16 sequences per device with 16 accumulation steps) and a sequence length of 512 tokens. The optimizer we used is Adam with the learning rate of $7e - 4$, $\beta_1 = 0.9$, $\beta_2 = 0.98$ and $\epsilon = 1e - 6$. The learning rate is warmed up for the first 1250 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint.

BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご