🚀 nlp-waseda/roberta-base-japanese-with-auto-jumanpp
This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC - 100, offering capabilities for masked language modeling and downstream task fine - tuning.
🚀 Quick Start
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...
You can fine - tune this model on downstream tasks.
✨ Features
- Masked Language Modeling: Can be used for masked language modeling tasks.
- Downstream Task Fine - Tuning: Allows fine - tuning on various downstream tasks.
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...
Advanced Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM, AdamW
import torch
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
optimizer = AdamW(model.parameters(), lr = 1e-4)
for epoch in range(num_epochs):
for batch in train_dataloader:
inputs = tokenizer(batch['text'], return_tensors='pt')
labels = batch['labels']
outputs = model(**inputs, labels = labels)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
📚 Documentation
Tokenization
BertJapaneseTokenizer
now supports automatic tokenization for Juman++. However, if your dataset is large, you may take a long time since BertJapaneseTokenizer
still does not support fast tokenization. You can still do the Juman++ tokenization by yourself and use the old model nlp-waseda/roberta-base-japanese.
Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.
Vocabulary
The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.
Training procedure
This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100. It took a week using eight NVIDIA A100 GPUs.
The following hyperparameters were used during pretraining:
- learning_rate: 1e - 4
- per_device_train_batch_size: 256
- distributed_type: multi - GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 4096
- max_seq_length: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- training_steps: 700000
- warmup_steps: 10000
- mixed_precision_training: Native AMP
Performance on JGLUE
See the Baseline Scores of JGLUE.
🔧 Technical Details
This model is based on the RoBERTa architecture, which is a robustly optimized BERT - like model. It was pretrained on large - scale Japanese corpora (Japanese Wikipedia and the Japanese portion of CC - 100) using masked language modeling objective. The use of Juman++ for tokenization helps in better handling of the Japanese language structure, and sentencepiece is used for further sub - tokenization. The hyperparameters were carefully tuned to achieve good performance on the masked language modeling task and can be further adjusted for downstream tasks.
📄 License
This model is licensed under the CC - BY - SA 4.0 license.
Property |
Details |
Model Type |
Japanese RoBERTa base model |
Training Data |
Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100 |