Bert-base-japanese open-source model - Free support for various Japanese natural language processing tasks

Bert Base Japanese

Developed by tohoku-nlp

A BERT model pretrained on Japanese Wikipedia text, using IPA dictionary for word-level tokenization, suitable for Japanese natural language processing tasks.

Large Language Model Japanese#Japanese Text Understanding #IPA Dictionary Tokenization #Wikipedia Pretraining

Downloads 153.44k

Release Time : 3/2/2022

Model Overview

This is a BERT model pretrained on Japanese text, utilizing IPA dictionary for word-level tokenization followed by WordPiece subword tokenization, suitable for various Japanese natural language understanding tasks.

Model Features

Japanese-Specific Tokenization

Uses MeCab morphological analyzer with IPA dictionary for Japanese-specific tokenization, ensuring efficient processing of Japanese text.

Large-Scale Pretraining

Trained on 2.6GB of Japanese Wikipedia corpus, containing approximately 17 million sentences.

Standard BERT Architecture

Adopts the same architecture and training parameters as the original BERT, ensuring compatibility and reliability.

Model Capabilities

Japanese Text Understanding

Japanese Text Classification

Japanese Question Answering

Japanese Named Entity Recognition

Japanese Semantic Similarity Calculation

Use Cases

Text Analysis

Japanese Sentiment Analysis

Analyze the sentiment tendency of Japanese text

Japanese Text Classification

Classify Japanese documents

Information Extraction

Japanese Named Entity Recognition

Extract entities such as person names and locations from Japanese text

🚀 BERT base Japanese (IPA dictionary)

A pre - trained BERT model for Japanese language processing, using IPA dictionary for word - level tokenization and WordPiece for subword tokenization.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language. This version of the model processes input texts with word - level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. The codes for the pretraining are available at [cl - tohoku/bert - japanese](https://github.com/cl - tohoku/bert - japanese/tree/v1.0).

✨ Features

The model uses word - level tokenization based on the IPA dictionary and WordPiece subword tokenization for text processing.
It has the same architecture as the original BERT base model.

🔧 Technical Details

Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019. To generate the training corpus, WikiExtractor is used to extract plain texts from a dump file of Wikipedia articles. The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

Tokenization

The texts are first tokenized by MeCab morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32000.

Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

📄 License

The pretrained models are distributed under the terms of the [Creative Commons Attribution - ShareAlike 3.0](https://creativecommons.org/licenses/by - sa/3.0/).

📚 Documentation

Acknowledgments

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program.

Property	Details
Model Type	BERT base Japanese (IPA dictionary)
Training Data	Japanese Wikipedia as of September 1, 2019, about 2.6GB, approximately 17M sentences

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご