BERT-base-Japanese-Whole-Word-Masking Open-source Model - Empowering Japanese Text Processing, Free to Deploy and Ready to Use!

Bert Base Japanese Whole Word Masking

Developed by tohoku-nlp

BERT model pretrained on Japanese text using IPA dictionary tokenization and whole word masking techniques

Large Language Model Japanese#Japanese Whole Word Masking #IPA Dictionary Tokenization #Wikipedia Pretraining

Downloads 113.33k

Release Time : 3/2/2022

Model Overview

This is a BERT model pretrained on Japanese Wikipedia corpus, primarily for Japanese natural language processing tasks. The model uses IPA dictionary for word-level tokenization and supports whole word masking training mechanism.

Model Features

IPA Dictionary Tokenization

Uses MeCab tokenizer with IPA dictionary for word-level segmentation, better suited for Japanese language characteristics

Whole Word Masking

Masks all subword tokens of complete words simultaneously during training to improve language modeling

Large-scale Pretraining

Trained for 1 million steps on 2.6GB Japanese Wikipedia corpus (~17 million sentences)

Model Capabilities

Japanese text understanding

Japanese language modeling

Japanese text feature extraction

Use Cases

Natural Language Processing

Japanese Text Classification

Can be used for news classification, sentiment analysis tasks

Japanese QA Systems

Serves as base model for building Japanese question answering applications

🚀 BERT base Japanese (IPA dictionary, whole word masking enabled)

This is a pre - trained BERT model for Japanese texts, which uses IPA dictionary for word - level tokenization and WordPiece sub - word tokenization, and enables whole word masking during training.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language.

This version of the model processes input texts with word - level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.

The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v1.0).

✨ Features

Processes Japanese texts using word - level tokenization based on the IPA dictionary and WordPiece sub - word tokenization.
Enables whole word masking for the masked language modeling (MLM) objective during training.

📚 Documentation

Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

Tokenization

The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32000.

Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

For the training of the MLM (masked language modeling) objective, we introduced the Whole Word Masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

🔧 Technical Details

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

📦 Additional Information

Property	Details
Datasets	Japanese Wikipedia (as of September 1, 2019)
License	Creative Commons Attribution - ShareAlike 3.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご