bert-base-japanese-char-v2 Open Source Japanese Text Model - Supports Character-level Tokenization to Process Various Japanese Texts

Bert Base Japanese Char V2

Developed by tohoku-nlp

BERT model pre-trained on Japanese text using character-level tokenization and whole word masking, trained on the Japanese Wikipedia version as of August 31, 2020

Large Language Model Japanese#Japanese Character-level BERT #Whole Word Masking MLM #Wikipedia Pre-training

Downloads 134.28k

Release Time : 3/2/2022

Model Overview

This is a BERT pre-trained model specifically designed for Japanese text, employing character-level tokenization and whole word masking strategies, suitable for various Japanese natural language processing tasks

Model Features

Character-level Tokenization

First uses MeCab+Unidic dictionary for word segmentation, then decomposes into characters, with a vocabulary size of 6144

Whole Word Masking Mechanism

In MLM tasks, all subword tokens of the same word are masked simultaneously

Professional Japanese Processing

Uses MeCab with mecab-ipadic-NEologd dictionary for text sentence segmentation

Model Capabilities

Japanese Text Understanding

Japanese Text Feature Extraction

Japanese Language Model Fine-tuning

Use Cases

Natural Language Processing

Japanese Text Classification

Can be used for tasks such as Japanese news classification and sentiment analysis

Japanese Question Answering System

Serves as a base model for building Japanese question answering systems

🚀 BERT base Japanese (character-level tokenization with whole word masking, jawiki-20200831)

This is a pre - trained BERT model for Japanese texts, featuring word - level and character - level tokenization and whole word masking.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word - level tokenization based on the Unidic 2.1.2 dictionary (available in [unidic - lite](https://pypi.org/project/unidic - lite/) package), followed by character - level tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v2.0).

✨ Features

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
It uses word - level tokenization based on the Unidic 2.1.2 dictionary and then character - level tokenization.
Whole word masking is enabled for the masked language modeling (MLM) objective.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original README, so this section is skipped.

📚 Documentation

Model Architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training Data

Property	Details
Model Type	BERT base Japanese (character - level tokenization with whole word masking, jawiki - 20200831)
Training Data	The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, containing approximately 30M sentences. The MeCab morphological parser with [mecab - ipadic - NEologd](https://github.com/neologd/mecab - ipadic - neologd) dictionary was used to split texts into sentences.

Property

Details

Model Type

BERT base Japanese (character - level tokenization with whole word masking, jawiki - 20200831)

Training Data

The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, containing approximately 30M sentences. The MeCab morphological parser with [mecab - ipadic - NEologd](https://github.com/neologd/mecab - ipadic - neologd) dictionary was used to split texts into sentences.

Tokenization

The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144. The fugashi and [unidic - lite](https://github.com/polm/unidic - lite) packages were used for the tokenization.

Training

The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, whole word masking was introduced in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, a v3 - 8 instance of Cloud TPUs provided by TensorFlow Research Cloud program was used. The training took about 5 days to finish.

🔧 Technical Details

The model combines word - level and character - level tokenization. Word - level tokenization is based on the Unidic 2.1.2 dictionary, which helps in better understanding the semantic units in Japanese. Character - level tokenization further breaks down the words into characters, which can capture more fine - grained information. The whole word masking technique in MLM training allows the model to learn the context of entire words, improving its language understanding ability.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

Acknowledgments

This model is trained with Cloud TPUs provided by TensorFlow Research Cloud program.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご