đ BERT base Japanese (character-level tokenization with whole word masking, jawiki-20200831)
This is a pre - trained BERT model for Japanese texts, featuring word - level and character - level tokenization and whole word masking.
đ Quick Start
This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word - level tokenization based on the Unidic 2.1.2 dictionary (available in [unidic - lite](https://pypi.org/project/unidic - lite/) package), followed by character - level tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v2.0).
⨠Features
- The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
- It uses word - level tokenization based on the Unidic 2.1.2 dictionary and then character - level tokenization.
- Whole word masking is enabled for the masked language modeling (MLM) objective.
đĻ Installation
No installation steps are provided in the original README, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original README, so this section is skipped.
đ Documentation
Model Architecture
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
Training Data
Property |
Details |
Model Type |
BERT base Japanese (character - level tokenization with whole word masking, jawiki - 20200831) |
Training Data |
The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, containing approximately 30M sentences. The MeCab morphological parser with [mecab - ipadic - NEologd](https://github.com/neologd/mecab - ipadic - neologd) dictionary was used to split texts into sentences. |
Tokenization
The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144. The fugashi
and [unidic - lite
](https://github.com/polm/unidic - lite) packages were used for the tokenization.
Training
The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, whole word masking was introduced in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, a v3 - 8 instance of Cloud TPUs provided by TensorFlow Research Cloud program was used. The training took about 5 days to finish.
đ§ Technical Details
The model combines word - level and character - level tokenization. Word - level tokenization is based on the Unidic 2.1.2 dictionary, which helps in better understanding the semantic units in Japanese. Character - level tokenization further breaks down the words into characters, which can capture more fine - grained information. The whole word masking technique in MLM training allows the model to learn the context of entire words, improving its language understanding ability.
đ License
The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).
Acknowledgments
This model is trained with Cloud TPUs provided by TensorFlow Research Cloud program.