Bert Large Japanese Char
B
Bert Large Japanese Char
Developed by tohoku-nlp
BERT model pretrained on Japanese Wikipedia, employing character-level tokenization and whole word masking strategy, suitable for Japanese natural language processing tasks
Downloads 24
Release Time : 3/2/2022
Model Overview
This model is a BERT variant specifically optimized for Japanese text, combining word-level and character-level tokenization techniques, excelling in masked language modeling tasks
Model Features
Hybrid Tokenization Strategy
First uses MeCab+Unidic for word-level tokenization, then splits into character-level representation, balancing word information and fine-grained processing
Whole Word Masking Training
All subword tokens of the same word are masked simultaneously, enhancing the model's understanding of complete words
Large-scale Pretraining
Trained for 1 million steps on 4.0GB Japanese Wikipedia corpus (30 million sentences)
Model Capabilities
Japanese Text Understanding
Masked Word Prediction
Contextual Representation Learning
Use Cases
Natural Language Processing
Text Infilling
Predict masked words in text, e.g., 'Engaging in [MASK] research at Tohoku University'
Downstream Task Fine-tuning
Can serve as a baseline model for NLP tasks like text classification and named entity recognition
Featured Recommended AI Models
Š 2025AIbase