Bert-base-japanese-char Open-source Model - A Pretrained Tool for Japanese Natural Language Processing Tasks

Bert Base Japanese Char

Developed by tohoku-nlp

A BERT model pretrained on Japanese text using character-level tokenization, suitable for Japanese natural language processing tasks.

Large Language Model Japanese#Japanese Text Processing #Character-level BERT #Wikipedia Pretraining

Downloads 116.10k

Release Time : 3/2/2022

Model Overview

This is a BERT model pretrained on Japanese Wikipedia text, utilizing IPA dictionary for word-level tokenization followed by character-level tokenization, suitable for various Japanese natural language understanding tasks.

Model Features

Character-level Tokenization

Employs a dual processing approach of word-level followed by character-level tokenization, better suited for Japanese language characteristics

Large-scale Pretraining

Trained on 2.6GB of Japanese Wikipedia text, containing approximately 17 million sentences

Compatibility with Original BERT

Model architecture and training parameters remain consistent with original BERT, facilitating transfer learning

Model Capabilities

Japanese Text Understanding

Japanese Text Classification

Japanese Question Answering Systems

Japanese Named Entity Recognition

Use Cases

Natural Language Processing

Japanese Text Classification

Sentiment analysis or topic classification for Japanese news, reviews, etc.

Japanese Question Answering System

Building intelligent Q&A applications in Japanese

🚀 BERT base Japanese (character tokenization)

This is a BERT model pretrained on Japanese texts. It processes input texts with word - level tokenization based on the IPA dictionary and then character - level tokenization.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word - level tokenization based on the IPA dictionary, followed by character - level tokenization. The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v1.0).

✨ Features

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into characters. The vocabulary size is 4000.
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original README, so this section is skipped.

📚 Documentation

Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

Tokenization

The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into characters. The vocabulary size is 4000.

Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

🔧 Technical Details

The model combines word - level tokenization using the IPA dictionary and MeCab morphological parser with character - level tokenization. It follows the original BERT architecture and training configuration, which allows it to effectively learn from Japanese texts. The use of Wikipedia data as training corpus provides a large - scale and diverse set of Japanese language data for the model to learn from.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

Acknowledgments

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

Information Table

Property	Details
Model Type	BERT base Japanese (character tokenization)
Training Data	Japanese Wikipedia as of September 1, 2019, extracted by WikiExtractor, 2.6GB in size, approximately 17M sentences
License	[Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご