B

Bert Large Japanese Char

Developed by tohoku-nlp
BERT model pretrained on Japanese Wikipedia, employing character-level tokenization and whole word masking strategy, suitable for Japanese natural language processing tasks
Downloads 24
Release Time : 3/2/2022

Model Overview

This model is a BERT variant specifically optimized for Japanese text, combining word-level and character-level tokenization techniques, excelling in masked language modeling tasks

Model Features

Hybrid Tokenization Strategy
First uses MeCab+Unidic for word-level tokenization, then splits into character-level representation, balancing word information and fine-grained processing
Whole Word Masking Training
All subword tokens of the same word are masked simultaneously, enhancing the model's understanding of complete words
Large-scale Pretraining
Trained for 1 million steps on 4.0GB Japanese Wikipedia corpus (30 million sentences)

Model Capabilities

Japanese Text Understanding
Masked Word Prediction
Contextual Representation Learning

Use Cases

Natural Language Processing
Text Infilling
Predict masked words in text, e.g., 'Engaging in [MASK] research at Tohoku University'
Downstream Task Fine-tuning
Can serve as a baseline model for NLP tasks like text classification and named entity recognition
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase