bert-large-japanese: An Open-Source Japanese BERT Model - Trained on Wikipedia to Enhance Japanese Text Processing

Bert Large Japanese

Developed by tohoku-nlp

BERT large model pretrained on Japanese Wikipedia, utilizing Unidic dictionary tokenization and whole word masking strategy

Large Language Model Japanese#Japanese Whole Word Masking #Wikipedia Pretraining #Unidic Tokenization

Downloads 1,272

Release Time : 3/2/2022

Model Overview

This is a BERT model optimized for Japanese text, suitable for various natural language processing tasks such as text classification, named entity recognition, and question answering systems.

Model Features

Whole Word Masking Strategy

Training method where all subword tokens corresponding to complete words segmented by MeCab are masked simultaneously, enhancing model comprehension

Unidic Dictionary Tokenization

Uses Unidic 2.1.2 dictionary for lexical-level tokenization, combined with WordPiece subword segmentation for input text processing

Large-scale Pretraining Data

Based on the Japanese Wikipedia version from August 31, 2020, containing approximately 30 million sentences

Model Capabilities

Japanese text understanding

Masked language modeling

Text feature extraction

Downstream NLP task fine-tuning

Use Cases

Natural Language Processing

Text Classification

Performing classification tasks on Japanese text

Named Entity Recognition

Identifying proper nouns and entities in Japanese text

Question Answering Systems

Building Japanese question answering systems

🚀 BERT large Japanese (unidic-lite with whole word masking, jawiki-20200831)

This is a pre - trained BERT model for Japanese texts, offering advanced language processing capabilities.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word - level tokenization based on the Unidic 2.1.2 dictionary (available in [unidic - lite](https://pypi.org/project/unidic - lite/) package), followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v2.0).

✨ Features

The model architecture is the same as the original BERT large model; 24 layers, 1024 dimensions of hidden states, and 16 attention heads.
The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768.
For training of the MLM (masked language modeling) objective, whole word masking is introduced, where all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

📚 Documentation

Model architecture

The model architecture is the same as the original BERT large model; 24 layers, 1024 dimensions of hidden states, and 16 attention heads.

Training Data

The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, containing approximately 30M sentences. We used the MeCab morphological parser with [mecab - ipadic - NEologd](https://github.com/neologd/mecab - ipadic - neologd) dictionary to split texts into sentences.

Tokenization

The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768. We used fugashi and [unidic - lite](https://github.com/polm/unidic - lite) packages for the tokenization.

Training

The models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. For training of each model, we used a v3 - 8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 5 days to finish.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

🔧 Technical Details

Property	Details
Model Type	BERT large Japanese (unidic - lite with whole word masking, jawiki - 20200831)
Training Data	Japanese version of Wikipedia (generated from the Wikipedia Cirrussearch dump file as of August 31, 2020, about 4.0GB, approximately 30M sentences)
Tokenization	First tokenized by MeCab with Unidic 2.1.2 dictionary, then split into subwords by WordPiece algorithm, vocabulary size 32768, using `fugashi` and `unidic - lite` packages
Training Configuration	512 tokens per instance, 256 instances per batch, 1M training steps, whole word masking for MLM objective
Hardware	v3 - 8 instance of Cloud TPUs provided by TensorFlow Research Cloud program
Training Time	About 5 days

Acknowledgments

This model is trained with Cloud TPUs provided by TensorFlow Research Cloud program.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご