Model Overview
Model Features
Model Capabilities
Use Cases
ð BERT-th
This project is adapted from https://github.com/ThAIKeras/bert for the HuggingFace/Transformers library. It presents a Thai-only pre-trained model based on the BERT-Base structure, aiming to address the challenges in Thai text pre - training.
ð Quick Start
Pre - tokenization
You must run the original ThaiTokenizer to ensure your tokenization matches that of the original model. If you skip this step, you won't perform much better than mBERT or random chance!
You can refer to this CoLab notebook or follow these steps:
pip install pythainlp six sentencepiece python-crfsuite
git clone https://github.com/ThAIKeras/bert
# download .vocab and .model files from ThAIKeras/bert > Tokenization section
Or download from .vocab and .model links.
Then set up the ThaiTokenizer class, which is slightly modified to remove a TensorFlow dependency.
import collections
import unicodedata
import six
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
vocab = collections.OrderedDict()
index = 0
with open(vocab_file, "r") as reader:
while True:
token = reader.readline()
if token.split(): token = token.split()[0] # to support SentencePiece vocab file
token = convert_to_unicode(token)
if not token:
break
token = token.strip()
vocab[token] = index
index += 1
return vocab
#####
from bert.bpe_helper import BPE
import sentencepiece as spm
def convert_by_vocab(vocab, items):
output = []
for item in items:
output.append(vocab[item])
return output
class ThaiTokenizer(object):
"""Tokenizes Thai texts."""
def __init__(self, vocab_file, spm_file):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.bpe = BPE(vocab_file)
self.s = spm.SentencePieceProcessor()
self.s.Load(spm_file)
def tokenize(self, text):
bpe_tokens = self.bpe.encode(text).split(' ')
spm_tokens = self.s.EncodeAsPieces(text)
tokens = bpe_tokens if len(bpe_tokens) < len(spm_tokens) else spm_tokens
split_tokens = []
for token in tokens:
new_token = token
if token.startswith('_') and not token in self.vocab:
split_tokens.append('_')
new_token = token[1:]
if not new_token in self.vocab:
split_tokens.append('<unk>')
else:
split_tokens.append(new_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
Then, pre - tokenize your own text:
from pythainlp import sent_tokenize
tokenizer = ThaiTokenizer(vocab_file='th.wiki.bpe.op25000.vocab', spm_file='th.wiki.bpe.op25000.model')
txt = "āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢāđāļāđāļāđāļāļāļāļāļāļĢāļāļāļāļīāđāļĻāļĐāļāļāļāļāļĢāļ°āđāļāļĻāđāļāļĒ āļĄāļīāđāļāđāļĄāļĩāļŠāļāļēāļāļ°āđāļāđāļāļāļąāļāļŦāļ§āļąāļ āļāļģāļ§āđāļē \"āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ\" āļāļąāđāļāļĒāļąāļāđāļāđāđāļĢāļĩāļĒāļāļāļāļāđāļāļĢāļāļāļāļĢāļāļāļŠāđāļ§āļāļāđāļāļāļāļīāđāļāļāļāļāļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢāļāļĩāļāļāđāļ§āļĒ"
split_sentences = sent_tokenize(txt)
print(split_sentences)
"""
['āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢāđāļāđāļāđāļāļāļāļāļāļĢāļāļāļāļīāđāļĻāļĐāļāļāļāļāļĢāļ°āđāļāļĻāđāļāļĒ ',
'āļĄāļīāđāļāđāļĄāļĩāļŠāļāļēāļāļ°āđāļāđāļāļāļąāļāļŦāļ§āļąāļ ',
'āļāļģāļ§āđāļē "āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ" āļāļąāđāļāļĒāļąāļāđāļāđāđāļĢāļĩāļĒāļāļāļāļāđāļāļĢāļāļāļāļĢāļāļāļŠāđāļ§āļāļāđāļāļāļāļīāđāļāļāļāļāļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢāļāļĩāļāļāđāļ§āļĒ']
"""
split_words = ' '.join(tokenizer.tokenize(' '.join(split_sentences)))
print(split_words)
"""
'âāļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ āđāļāđāļāđāļāļ āļāļāļāļĢāļāļ āļāļīāđāļĻāļĐ āļāļāļāļāļĢāļ°āđāļāļĻāđāļāļĒ âāļĄāļī āđāļāđāļĄāļĩ āļŠāļāļēāļāļ°āđāļāđāļ āļāļąāļāļŦāļ§āļąāļ âāļāđāļēāļ§āđāļē â" āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ " âāļāļąāđāļ...' # continues
"""
âĻ Features
Google's BERT is currently the state - of - the - art method for pre - training text representations, providing multilingual models. However, Thai was previously excluded from the 103 languages due to word segmentation difficulties. BERT - th fills this gap by presenting a Thai - only pre - trained model based on the BERT - Base structure. It also includes relevant codes and scripts, all modified from the original BERT project.
ðĶ Installation
Preprocessing
Data Source
The training data for BERT - th come from the latest article dump of Thai Wikipedia on November 2, 2018. The raw texts are extracted using WikiExtractor.
Sentence Segmentation
Input data need to be segmented into separate sentences before further processing by BERT modules. Since the Thai language has no explicit sentence - ending markers, it's challenging to determine sentence boundaries. In this project, sentence segmentation is done using simple heuristics, considering spaces, sentence length, and common conjunctions.
After preprocessing, the training corpus consists of approximately 2 million sentences and 40 million words (counting words after word segmentation by PyThaiNLP). The plain and segmented texts can be downloaded here
.
Tokenization
BERT uses WordPiece as a tokenization mechanism. However, as it's Google - internal, we can't apply existing Thai word segmentation and then use WordPiece. The best alternative is SentencePiece, which implements BPE and doesn't require word segmentation.
In this project, we adopt a pre - trained Thai SentencePiece model from BPEmb. The model with 25000 vocabularies is chosen, and the vocabulary file is augmented with BERT's special characters, including '[PAD]', '[CLS]', '[SEP]', and '[MASK]'. The model and vocabulary files can be downloaded here
.
Both SentencePiece
and bpe_helper.py
from BPEmb are used to tokenize data. The ThaiTokenizer
class has been added to BERT's tokenization.py
for tokenizing Thai texts.
Pre - training
Prepare the data before pre - training using this script:
export BPE_DIR=/path/to/bpe
export TEXT_DIR=/path/to/text
export DATA_DIR=/path/to/data
python create_pretraining_data.py \
--input_file=$TEXT_DIR/thaiwikitext_sentseg \
--output_file=$DATA_DIR/tf_examples.tfrecord \
--vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5 \
--thai_text=True \
--spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
Then, run the following script to train a model from scratch:
export DATA_DIR=/path/to/data
export BERT_BASE_DIR=/path/to/bert_base
python run_pretraining.py \
--input_file=$DATA_DIR/tf_examples.tfrecord \
--output_dir=$BERT_BASE_DIR \
--do_train=True \
--do_eval=True \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=1000000 \
--num_warmup_steps=100000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=200000
The model has been trained for 1 million steps. On a Tesla K80 GPU, it took around 20 days to complete. However, a snapshot at 0.8 million steps is provided as it yields better results for downstream classification tasks.
ðŧ Usage Examples
Downstream Classification Tasks
XNLI
XNLI is a dataset for evaluating a cross - lingual inferential classification task. The development and test sets contain 15 languages with thoroughly edited data, and machine - translated training data are also provided.
The Thai - only pre - trained BERT model can be applied to the XNLI task using training data translated to Thai. Spaces between words in the training data need to be removed to match the pre - training inputs. The processed XNLI files related to the Thai language can be downloaded here
.
Then, run the following script to train on the XNLI task:
export BPE_DIR=/path/to/bpe
export XNLI_DIR=/path/to/xnli
export OUTPUT_DIR=/path/to/output
export BERT_BASE_DIR=/path/to/bert_base
python run_classifier.py \
--task_name=XNLI \
--do_train=true \
--do_eval=true \
--data_dir=$XNLI_DIR \
--vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--output_dir=$OUTPUT_DIR \
--xnli_language=th \
--spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
The following table compares the Thai - only model with XNLI baselines and the Multilingual Cased model trained using translated data:
XNLI Baseline | BERT | ||
Translate Train | Translate Test | Multilingual Model | Thai-only Model |
62.8 | 64.4 | 66.1 | 68.9 |
Wongnai Review Dataset
The Wongnai Review Dataset collects restaurant reviews and ratings from the Wongnai website. The task is to classify a review into one of five ratings (1 to 5 stars). The dataset can be downloaded here
, and the following script can be run to use the Thai - only model for this task:
export BPE_DIR=/path/to/bpe
export WONGNAI_DIR=/path/to/wongnai
export OUTPUT_DIR=/path/to/output
export BERT_BASE_DIR=/path/to/bert_base
python run_classifier.py \
--task_name=wongnai \
--do_train=true \
--do_predict=true \
--data_dir=$WONGNAI_DIR \
--vocab_file=$BPE_DIR/th.wiki.bpe.op25000.vocab \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=5e-5 \
--num_train_epochs=2.0 \
--output_dir=$OUTPUT_DIR \
--spm_file=$BPE_DIR/th.wiki.bpe.op25000.model
Without additional preprocessing and further fine - tuning, the Thai - only BERT model can achieve 0.56612 and 0.57057 for public and private test - set scores respectively.

