bert-base-ja Open-source Japanese Model - Free for Japanese Text Mask Filling Tasks

Home

Bert Base Ja

Developed by colorfulscoop

BERT base model trained on Japanese Wikipedia dataset, suitable for masked language modeling tasks in Japanese text

Large Language Model

Transformers

Japanese#Japanese text filling #Wikipedia pre-training #SentencePiece tokenization

Downloads 16

Release Time : 3/2/2022

Model Overview

This is a BERT base model trained on Japanese Wikipedia dataset, primarily used for masked language modeling tasks in Japanese. The model adopts standard BERT architecture with a vocabulary size of 32,000.

Model Features

Japanese-specific vocabulary

Vocabulary size set to 32,000, specifically optimized for Japanese text

SentencePiece tokenizer

Uses SentencePiece model for tokenization, specially handling Japanese text that doesn't use spaces to separate words

Stable tokenization behavior

Uses DebertaV2Tokenizer to ensure consistent tokenization behavior across different environments

Model Capabilities

Japanese text understanding

Masked language modeling prediction

Use Cases

Education

Subject prediction

Predicting subjects students might be good at

Example: '得意な科目は[MASK]です' → '得意な科目は数学です' (My strong subject is [MASK] → My strong subject is mathematics)

Academic

Field of study prediction

Predicting academic fields of study

Example: '専門として[MASK]を専攻しています' → '専門として工学を専攻しています' (I'm majoring in [MASK] → I'm majoring in engineering)

🚀 BERT base Japanese model

This repository houses a BERT base model trained on the Japanese Wikipedia dataset. It can be used for the fill - mask task, offering a solution for natural language processing in Japanese.

🚀 Quick Start

First, install the necessary dependencies:

$ pip install torch==1.8.0 transformers==4.8.2 sentencepiece==0.1.95

Then, use transformers.pipeline to perform the mask fill task:

>>> import transformers
>>> pipeline = transformers.pipeline("fill-mask", "colorfulscoop/bert-base-ja", revision="v1.0")
>>> pipeline("専門として[MASK]を専攻しています")
[{'sequence': '専門として工学を専攻しています', 'score': 0.03630176931619644, 'token': 3988, 'token_str': '工学'}, {'sequence': '専門として政治学を専攻しています', 'score': 0.03547220677137375, 'token': 22307, 'token_str': '政治学'}, {'sequence': '専門として教育を専攻しています', 'score': 0.03162326663732529, 'token': 414, 'token_str': '教育'}, {'sequence': '専門として経済学を専攻しています', 'score': 0.026036914438009262, 'token': 6814, 'token_str': '経済学'}, {'sequence': '専門として法学を専攻しています', 'score': 0.02561848610639572, 'token': 10810, 'token_str': '法学'}]

⚠️ Important Note

It is recommended to specify a revision option to ensure reproducibility when downloading the model via transformers.pipeline or transformers.AutoModel.from_pretrained.

✨ Features

Trained on the Japanese Wikipedia dataset, which provides rich Japanese language knowledge.
Uses a custom - sized vocabulary (32,000) for better adaptation to the Japanese language.
Employs transformers.DebertaV2Tokenizer to avoid inconsistent tokenization behavior.

📦 Installation

Install the required dependencies using the following command:

$ pip install torch==1.8.0 transformers==4.8.2 sentencepiece==0.1.95

💻 Usage Examples

Basic Usage

import transformers
pipeline = transformers.pipeline("fill-mask", "colorfulscoop/bert-base-ja", revision="v1.0")
result = pipeline("専門として[MASK]を専攻しています")
print(result)

📚 Documentation

Model description

The model architecture is similar to the BERT base model (hidden_size: 768, num_hidden_layers: 12, num_attention_heads: 12, max_position_embeddings: 512), but with a vocabulary size of 32,000 instead of the original 30,522. transformers.BertForPreTraining is used for the model.

Tokenizer description

The SentencePiece tokenizer is used. During training, it was trained with 1,000,000 samples from the train split, with a vocabulary size of 32,000. The add_dummy_prefix option is set to True due to the lack of whitespace word separation in Japanese. After training, the model is imported to transformers.DebertaV2Tokenizer to ensure consistent tokenization behavior.

Training data

The Japanese Wikipedia dataset as of June 20, 2021, released under Creative Commons Attribution - ShareAlike 3.0, is used for training. The dataset is split into train, valid, and test subsets.

Training

Training details:

Gradient update is every 256 samples (batch size: 8, accumulate_grad_batches: 32).
Gradient clip norm is 1.0.
Learning rate starts from 0 and linearly increases to 0.0001 in the first 10,000 steps.
The training set has around 20M samples, and 1 epoch has around 80k steps.
Training was done on Ubuntu 18.04.5 LTS with one RTX 2080 Ti.
Training continued until the validation loss worsened, with around 214k training steps in total. The test set loss was 2.80.
The training code is available in a GitHub repository.

Usage

After installation, use transformers.pipeline to perform the fill - mask task.

License

All models in this repository are licensed under Creative Commons Attribution - ShareAlike 3.0.

Property	Details
Model Type	BERT base model with a custom vocabulary size
Training Data	Japanese Wikipedia dataset as of June 20, 2021, under CC - BY - SA 3.0

🔧 Technical Details

The model uses a BERT base architecture with a modified vocabulary size. The tokenizer training and selection are carefully designed to adapt to the Japanese language characteristics. The training process involves specific hyperparameter settings and is conducted on a specific hardware and software environment.

📄 License

Disclaimer: The model may generate texts similar to the training data, untrue texts, or biased texts. Use of the model is at your own risk. Colorful Scoop makes no warranty or guarantee for the model's outputs and is not liable for any issues arising from the model output.

This model uses the following data for training:

Name: ウィキペディア (Wikipedia): フリー百科事典
Credit: https://ja.wikipedia.org/
License: Creative Commons Attribution - ShareAlike 3.0
Link: https://ja.wikipedia.org/

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご