B

Bert Base Japanese Char Whole Word Masking

Developed by tohoku-nlp
A BERT model pre-trained on Japanese text using character-level tokenization and whole word masking techniques, suitable for Japanese natural language processing tasks.
Downloads 1,724
Release Time : 3/2/2022

Model Overview

This is a BERT model pre-trained on Japanese Wikipedia text, utilizing character-level tokenization and whole word masking techniques. It is suitable for various Japanese natural language processing tasks such as text classification and named entity recognition.

Model Features

Character-level Tokenization
Text is first segmented into words using the MeCab tokenizer with the IPA dictionary, then further split into characters, enhancing the model's ability to handle complex Japanese text.
Whole Word Masking Technique
During MLM task training, when a word is selected for masking, all corresponding subword tokens are simultaneously masked, improving the model's language understanding capabilities.
Wikipedia-based Pre-training
Training corpus is sourced from a September 1, 2019 snapshot of Japanese Wikipedia, containing approximately 17 million sentences with a total size of 2.6GB.

Model Capabilities

Japanese Text Understanding
Masked Language Modeling
Text Classification
Named Entity Recognition
Question Answering Systems

Use Cases

Natural Language Processing
Japanese Text Classification
Can be used for classifying Japanese text, such as news categorization and sentiment analysis.
Named Entity Recognition
Can be used to identify named entities in Japanese text, such as person names, locations, and organization names.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase