🚀 nlp-waseda/bigbird-base-japanese
This is a Japanese BigBird base model. It is pretrained on Japanese Wikipedia, the Japanese portion of CC - 100, and the Japanese portion of OSCAR, offering strong capabilities for natural language processing tasks in Japanese.
🚀 Quick Start
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/bigbird-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/bigbird-base-japanese")
sentence = '[MASK] 大学 で 自然 言語 処理 を 学ぶ 。'
encoding = tokenizer(sentence, return_tensors='pt')
...
You can fine - tune this model on downstream tasks.
✨ Features
- Pretrained on Multiple Datasets: Trained on Japanese Wikipedia, the Japanese part of CC - 100, and the Japanese part of OSCAR, providing rich language knowledge.
- Long - sequence Handling: Based on the BigBird architecture, it can handle long sequences effectively.
- Fine - tunable: Can be fine - tuned on various downstream tasks.
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/bigbird-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/bigbird-base-japanese")
sentence = '[MASK] 大学 で 自然 言語 処理 を 学ぶ 。'
encoding = tokenizer(sentence, return_tensors='pt')
...
Advanced Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM, TrainingArguments, Trainer
import torch
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/bigbird-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/bigbird-base-japanese")
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
evaluation_strategy="steps",
eval_steps=50
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
📚 Documentation
Tokenization
The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.
Vocabulary
The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.
Training procedure
This model was trained on Japanese Wikipedia (as of 20221101), the Japanese portion of CC - 100, and the Japanese portion of OSCAR. It took two weeks using 16 NVIDIA A100 GPUs using transformers and DeepSpeed.
The following hyperparameters were used during pretraining:
- learning_rate: 1e - 4
- per_device_train_batch_size: 6
- gradient_accumulation_steps: 2
- total_train_batch_size: 192
- max_seq_length: 4096
- training_steps: 600000
- warmup_steps: 6000
- bf16: true
- deepspeed: ds_config.json
Performance on JGLUE
We fine - tuned the following models and evaluated them on the dev set of JGLUE.
We tuned learning rate and training epochs for each model and task following the JGLUE paper.
For the tasks other than MARC - ja, the maximum length is short, so the attention_type was set to "original_full", and fine - tuning was performed. For MARC - ja, both "block_sparse" and "original_full" were used.
Model |
MARC-ja/acc |
JSTS/pearson |
JSTS/spearman |
JNLI/acc |
JSQuAD/EM |
JSQuAD/F1 |
JComQA/acc |
Waseda RoBERTa base |
0.965 |
0.913 |
0.876 |
0.905 |
0.853 |
0.916 |
0.853 |
Waseda RoBERTa large (seq512) |
0.969 |
0.925 |
0.890 |
0.928 |
0.910 |
0.955 |
0.900 |
BigBird base (original_full) |
0.959 |
0.888 |
0.846 |
0.896 |
0.884 |
0.933 |
0.787 |
BigBird base (block_sparse) |
0.959 |
- |
- |
- |
- |
- |
- |
🔧 Technical Details
The model is based on the BigBird architecture, which is designed to handle long - sequence problems more efficiently. It uses a sparse attention mechanism to reduce the computational complexity of traditional self - attention. During training, it leverages multiple large - scale Japanese datasets and uses advanced training frameworks such as transformers and DeepSpeed.
📄 License
The model is licensed under CC - BY - SA - 4.0.
Acknowledgments
This work was supported by AI Bridging Cloud Infrastructure (ABCI) through the "Construction of a Japanese Large - Scale General - Purpose Language Model that Handles Long Sequences" at the 3rd ABCI Grand Challenge 2022.