B

Bert Base Japanese Whole Word Masking

Developed by tohoku-nlp
BERT model pretrained on Japanese text using IPA dictionary tokenization and whole word masking techniques
Downloads 113.33k
Release Time : 3/2/2022

Model Overview

This is a BERT model pretrained on Japanese Wikipedia corpus, primarily for Japanese natural language processing tasks. The model uses IPA dictionary for word-level tokenization and supports whole word masking training mechanism.

Model Features

IPA Dictionary Tokenization
Uses MeCab tokenizer with IPA dictionary for word-level segmentation, better suited for Japanese language characteristics
Whole Word Masking
Masks all subword tokens of complete words simultaneously during training to improve language modeling
Large-scale Pretraining
Trained for 1 million steps on 2.6GB Japanese Wikipedia corpus (~17 million sentences)

Model Capabilities

Japanese text understanding
Japanese language modeling
Japanese text feature extraction

Use Cases

Natural Language Processing
Japanese Text Classification
Can be used for news classification, sentiment analysis tasks
Japanese QA Systems
Serves as base model for building Japanese question answering applications
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase