bert-large-japanese-char Open Source Model - Free Support for Japanese Natural Language Processing Tasks

Bert Large Japanese Char

Developed by tohoku-nlp

BERT model pretrained on Japanese Wikipedia, employing character-level tokenization and whole word masking strategy, suitable for Japanese natural language processing tasks

Large Language Model Japanese#Japanese Whole Word Masked BERT #Character-level Japanese Understanding #Wikipedia Pretraining

Downloads 24

Release Time : 3/2/2022

Model Overview

This model is a BERT variant specifically optimized for Japanese text, combining word-level and character-level tokenization techniques, excelling in masked language modeling tasks

Model Features

Hybrid Tokenization Strategy

First uses MeCab+Unidic for word-level tokenization, then splits into character-level representation, balancing word information and fine-grained processing

Whole Word Masking Training

All subword tokens of the same word are masked simultaneously, enhancing the model's understanding of complete words

Large-scale Pretraining

Trained for 1 million steps on 4.0GB Japanese Wikipedia corpus (30 million sentences)

Model Capabilities

Japanese Text Understanding

Masked Word Prediction

Contextual Representation Learning

Use Cases

Natural Language Processing

Text Infilling

Predict masked words in text, e.g., 'Engaging in [MASK] research at Tohoku University'

Downstream Task Fine-tuning

Can serve as a baseline model for NLP tasks like text classification and named entity recognition

🚀 BERT large Japanese (character - level tokenization with whole word masking, jawiki - 20200831)

This is a pre - trained BERT model designed for processing Japanese texts. It uses word - level tokenization based on the Unidic 2.1.2 dictionary and then further splits into characters. The whole word masking technique is applied for the masked language modeling objective.

🚀 Quick Start

This is a BERT model pretrained on Japanese texts. The pretraining codes can be found at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v2.0).

✨ Features

Processes input texts with word - level tokenization based on the Unidic 2.1.2 dictionary, followed by character - level tokenization.
Trained with whole word masking for the masked language modeling (MLM) objective.

🔧 Technical Details

Model architecture

The model architecture is the same as the original BERT large model, having 24 layers, 1024 dimensions of hidden states, and 16 attention heads.

Training Data

The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The total size of the generated corpus files is 4.0GB, containing approximately 30M sentences. The MeCab morphological parser with [mecab - ipadic - NEologd](https://github.com/neologd/mecab - ipadic - neologd) dictionary is used to split texts into sentences.

Tokenization

The texts are first tokenized by MeCab with the Unidic 2.1.2 dictionary and then split into characters. The vocabulary size is 6144. The fugashi and [unidic - lite](https://github.com/polm/unidic - lite) packages are used for the tokenization.

Training

The models are trained with the same configuration as the original BERT: 512 tokens per instance, 256 instances per batch, and 1M training steps. For the MLM objective, whole word masking is introduced. Each model is trained on a v3 - 8 instance of Cloud TPUs provided by TensorFlow Research Cloud program, and the training takes about 5 days.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

Acknowledgments

This model is trained with Cloud TPUs provided by TensorFlow Research Cloud program.

Property	Details
Model Type	BERT large Japanese (character - level tokenization with whole word masking, jawiki - 20200831)
Training Data	Japanese version of Wikipedia (generated from the Wikipedia Cirrussearch dump file as of August 31, 2020)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご