🚀 BERT base Japanese model
This repository houses a BERT base model trained on the Japanese Wikipedia dataset. It can be used for the fill - mask task, offering a solution for natural language processing in Japanese.
🚀 Quick Start
First, install the necessary dependencies:
$ pip install torch==1.8.0 transformers==4.8.2 sentencepiece==0.1.95
Then, use transformers.pipeline
to perform the mask fill task:
>>> import transformers
>>> pipeline = transformers.pipeline("fill-mask", "colorfulscoop/bert-base-ja", revision="v1.0")
>>> pipeline("専門として[MASK]を専攻しています")
[{'sequence': '専門として工学を専攻しています', 'score': 0.03630176931619644, 'token': 3988, 'token_str': '工学'}, {'sequence': '専門として政治学を専攻しています', 'score': 0.03547220677137375, 'token': 22307, 'token_str': '政治学'}, {'sequence': '専門として教育を専攻しています', 'score': 0.03162326663732529, 'token': 414, 'token_str': '教育'}, {'sequence': '専門として経済学を専攻しています', 'score': 0.026036914438009262, 'token': 6814, 'token_str': '経済学'}, {'sequence': '専門として法学を専攻しています', 'score': 0.02561848610639572, 'token': 10810, 'token_str': '法学'}]
⚠️ Important Note
It is recommended to specify a revision
option to ensure reproducibility when downloading the model via transformers.pipeline
or transformers.AutoModel.from_pretrained
.
✨ Features
- Trained on the Japanese Wikipedia dataset, which provides rich Japanese language knowledge.
- Uses a custom - sized vocabulary (32,000) for better adaptation to the Japanese language.
- Employs
transformers.DebertaV2Tokenizer
to avoid inconsistent tokenization behavior.
📦 Installation
Install the required dependencies using the following command:
$ pip install torch==1.8.0 transformers==4.8.2 sentencepiece==0.1.95
💻 Usage Examples
Basic Usage
import transformers
pipeline = transformers.pipeline("fill-mask", "colorfulscoop/bert-base-ja", revision="v1.0")
result = pipeline("専門として[MASK]を専攻しています")
print(result)
📚 Documentation
Model description
The model architecture is similar to the BERT base model (hidden_size: 768, num_hidden_layers: 12, num_attention_heads: 12, max_position_embeddings: 512), but with a vocabulary size of 32,000 instead of the original 30,522. transformers.BertForPreTraining
is used for the model.
Tokenizer description
The SentencePiece tokenizer is used. During training, it was trained with 1,000,000 samples from the train split, with a vocabulary size of 32,000. The add_dummy_prefix
option is set to True
due to the lack of whitespace word separation in Japanese. After training, the model is imported to transformers.DebertaV2Tokenizer
to ensure consistent tokenization behavior.
Training data
The Japanese Wikipedia dataset as of June 20, 2021, released under Creative Commons Attribution - ShareAlike 3.0, is used for training. The dataset is split into train, valid, and test subsets.
Training
Training details:
- Gradient update is every 256 samples (batch size: 8, accumulate_grad_batches: 32).
- Gradient clip norm is 1.0.
- Learning rate starts from 0 and linearly increases to 0.0001 in the first 10,000 steps.
- The training set has around 20M samples, and 1 epoch has around 80k steps.
- Training was done on Ubuntu 18.04.5 LTS with one RTX 2080 Ti.
- Training continued until the validation loss worsened, with around 214k training steps in total. The test set loss was 2.80.
- The training code is available in a GitHub repository.
Usage
After installation, use transformers.pipeline
to perform the fill - mask task.
License
All models in this repository are licensed under Creative Commons Attribution - ShareAlike 3.0.
Property |
Details |
Model Type |
BERT base model with a custom vocabulary size |
Training Data |
Japanese Wikipedia dataset as of June 20, 2021, under CC - BY - SA 3.0 |
🔧 Technical Details
The model uses a BERT base architecture with a modified vocabulary size. The tokenizer training and selection are carefully designed to adapt to the Japanese language characteristics. The training process involves specific hyperparameter settings and is conducted on a specific hardware and software environment.
📄 License
Copyright (c) 2021 Colorful Scoop. All the models in this repository are licensed under Creative Commons Attribution - ShareAlike 3.0.
Disclaimer: The model may generate texts similar to the training data, untrue texts, or biased texts. Use of the model is at your own risk. Colorful Scoop makes no warranty or guarantee for the model's outputs and is not liable for any issues arising from the model output.
This model uses the following data for training: