🚀 japanese-roberta-base
This repository offers a base-sized Japanese RoBERTa model. It was trained using code from the Github repository rinnakk/japanese-pretrained-models by rinna Co., Ltd.. The model aims to provide effective solutions for Japanese natural language processing tasks.
🚀 Quick Start
Loading the Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-roberta-base", use_fast=False)
tokenizer.do_lower_case = True
model = AutoModelForMaskedLM.from_pretrained("rinna/japanese-roberta-base")
Using the Model for Masked Token Prediction
⚠️ Important Note
- Use
[CLS]
: To predict a masked token, add a [CLS]
token before the sentence as it was used during model training.
- Use
[MASK]
after tokenization: Directly typing [MASK]
in an input string and replacing a token with [MASK]
after tokenization yield different results. It's better to use [MASK]
after tokenization, but the Huggingface Inference API only supports typing [MASK]
in the input string.
- Provide
position_ids
explicitly: When position_ids
are not provided for a Roberta*
model, Huggingface's transformers
constructs it starting from padding_idx
instead of 0
, which doesn't work well with rinna/japanese-roberta-base
. So, construct position_ids
starting from 0
manually.
💡 Usage Tip
When using the model for masked token prediction, carefully follow the above notes to get accurate results.
Example
text = "4年に1度オリンピックは開かれる。"
text = "[CLS]" + text
tokens = tokenizer.tokenize(text)
print(tokens)
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
import torch
token_tensor = torch.LongTensor([token_ids])
position_ids = list(range(0, token_tensor.size(1)))
print(position_ids)
position_id_tensor = torch.LongTensor([position_ids])
with torch.no_grad():
outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)
predictions = outputs[0][0, masked_idx].topk(10)
for i, index_t in enumerate(predictions.indices):
index = index_t.item()
token = tokenizer.convert_ids_to_tokens([index])[0]
print(i, token)
"""
0 総会
1 サミット
2 ワールドカップ
3 フェスティバル
4 大会
5 オリンピック
6 全国大会
7 党大会
8 イベント
9 世界選手権
"""
✨ Features
- Transformer-based: A 12-layer, 768-hidden-size transformer-based masked language model.
- Trained on large datasets: Trained on Japanese CC-100 and Japanese Wikipedia to optimize a masked language modelling objective.
📚 Documentation
Model architecture
A 12-layer, 768-hidden-size transformer-based masked language model.
Training
The model was trained on Japanese CC-100 and Japanese Wikipedia to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100.
Tokenization
The model uses a sentencepiece-based tokenizer, and the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.
Release date
August 25, 2021
How to cite
@misc{rinna-japanese-roberta-base,
title = {rinna/japanese-roberta-base},
author = {Zhao, Tianyu and Sawada, Kei},
url = {https://huggingface.co/rinna/japanese-roberta-base}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
📄 License
The MIT license
📦 Additional Information
Property |
Details |
Model Type |
A 12-layer, 768-hidden-size transformer-based masked language model |
Training Data |
Japanese CC-100 and Japanese Wikipedia |
Mask Token |
[MASK] |
Tags |
roberta, masked-lm, nlp |
Thumbnail |
rinna-icon |
Widget Example |
[CLS]4年に1度[MASK]は開かれる。 |