🚀 nlp-waseda/roberta-large-japanese-with-auto-jumanpp
This project presents a large Japanese RoBERTa model, pretrained on Japanese Wikipedia and the Japanese part of CC - 100, aiming to offer powerful natural language processing capabilities for Japanese texts.
🚀 Quick Start
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")
sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...
You can fine-tune this model on downstream tasks.
✨ Features
- Pretrained on large - scale data: Trained on Japanese Wikipedia and the Japanese portion of CC - 100, providing rich language knowledge.
- Support for Juman++ tokenization:
BertJapaneseTokenizer
supports automatic tokenization for Juman++.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")
sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...
Advanced Usage
📚 Documentation
Tokenization
BertJapaneseTokenizer
now supports automatic tokenization for Juman++. However, if your dataset is large, you may take a long time since BertJapaneseTokenizer
still does not support fast tokenization. You can still do the Juman++ tokenization by yourself and use the old model nlp-waseda/roberta-large-japanese.
Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.
Vocabulary
The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.
Training procedure
This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100. It took two weeks using eight NVIDIA A100 GPUs.
The following hyperparameters were used during pretraining:
- learning_rate: 6e - 5
- per_device_train_batch_size: 103
- distributed_type: multi - GPU
- num_devices: 8
- gradient_accumulation_steps: 5
- total_train_batch_size: 4120
- max_seq_length: 128
- optimizer: Adam with betas=(0.9,0.98) and epsilon = 1e - 6
- lr_scheduler_type: linear
- training_steps: 670000
- warmup_steps: 10000
- mixed_precision_training: Native AMP
Performance on JGLUE
See the Baseline Scores of JGLUE.
🔧 Technical Details
This model is a Japanese RoBERTa large - scale model. It uses Juman++ for word - level tokenization and sentencepiece for sub - word tokenization. The vocabulary size is 32000, which includes words from JumanDIC and sub - words induced by the unigram language model of sentencepiece. During the training process, it uses eight NVIDIA A100 GPUs and takes two weeks to train on Japanese Wikipedia and the Japanese portion of CC - 100.
📄 License
This model is licensed under the CC - BY - SA 4.0 license.
Property |
Details |
Model Type |
Japanese RoBERTa large model |
Training Data |
Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100 |