B

Bert Large Japanese

Developed by tohoku-nlp
BERT large model pretrained on Japanese Wikipedia, utilizing Unidic dictionary tokenization and whole word masking strategy
Downloads 1,272
Release Time : 3/2/2022

Model Overview

This is a BERT model optimized for Japanese text, suitable for various natural language processing tasks such as text classification, named entity recognition, and question answering systems.

Model Features

Whole Word Masking Strategy
Training method where all subword tokens corresponding to complete words segmented by MeCab are masked simultaneously, enhancing model comprehension
Unidic Dictionary Tokenization
Uses Unidic 2.1.2 dictionary for lexical-level tokenization, combined with WordPiece subword segmentation for input text processing
Large-scale Pretraining Data
Based on the Japanese Wikipedia version from August 31, 2020, containing approximately 30 million sentences

Model Capabilities

Japanese text understanding
Masked language modeling
Text feature extraction
Downstream NLP task fine-tuning

Use Cases

Natural Language Processing
Text Classification
Performing classification tasks on Japanese text
Named Entity Recognition
Identifying proper nouns and entities in Japanese text
Question Answering Systems
Building Japanese question answering systems
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase