Bert-base-japanese-char-whole-word-masking open-source model - Free support for Japanese natural language processing tasks

Bert Base Japanese Char Whole Word Masking

Developed by tohoku-nlp

A BERT model pre-trained on Japanese text using character-level tokenization and whole word masking techniques, suitable for Japanese natural language processing tasks.

Large Language Model Japanese#Japanese Character-level BERT #Whole Word Masking Training #Wikipedia Pre-training

Downloads 1,724

Release Time : 3/2/2022

Model Overview

This is a BERT model pre-trained on Japanese Wikipedia text, utilizing character-level tokenization and whole word masking techniques. It is suitable for various Japanese natural language processing tasks such as text classification and named entity recognition.

Model Features

Character-level Tokenization

Text is first segmented into words using the MeCab tokenizer with the IPA dictionary, then further split into characters, enhancing the model's ability to handle complex Japanese text.

Whole Word Masking Technique

During MLM task training, when a word is selected for masking, all corresponding subword tokens are simultaneously masked, improving the model's language understanding capabilities.

Wikipedia-based Pre-training

Training corpus is sourced from a September 1, 2019 snapshot of Japanese Wikipedia, containing approximately 17 million sentences with a total size of 2.6GB.

Model Capabilities

Japanese Text Understanding

Masked Language Modeling

Text Classification

Named Entity Recognition

Question Answering Systems

Use Cases

Natural Language Processing

Japanese Text Classification

Can be used for classifying Japanese text, such as news categorization and sentiment analysis.

Named Entity Recognition

Can be used to identify named entities in Japanese text, such as person names, locations, and organization names.

🚀 BERT base Japanese (character tokenization, whole word masking enabled)

This is a pre - trained BERT model for Japanese texts, enabling whole - word masking and using character - level tokenization.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word - level tokenization based on the IPA dictionary, followed by character - level tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.

The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v1.0).

✨ Features

Tokenization Strategy: The model first conducts word - level tokenization using the IPA dictionary and then performs character - level tokenization.
Whole Word Masking: Enables whole word masking for the masked language modeling (MLM) objective.

📚 Documentation

Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

Tokenization

The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into characters. The vocabulary size is 4000.

Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

For the training of the MLM (masked language modeling) objective, we introduced the Whole Word Masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

🔧 Technical Details

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

Information Table

Property	Details
Model Type	BERT base Japanese (character tokenization, whole word masking enabled)
Training Data	Japanese Wikipedia as of September 1, 2019, 2.6GB text files with approximately 17M sentences
Tokenization	First tokenized by MeCab with IPA dictionary, then split into characters, vocabulary size 4000
Training Configuration	512 tokens per instance, 256 instances per batch, 1M training steps

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご