japanese - roberta - base Open - source Japanese Model - Free for Japanese Text Masked Language Modeling

Japanese Roberta Base

Developed by rinna

A base-sized Japanese RoBERTa model trained by rinna Co., Ltd., suitable for masked language modeling tasks in Japanese text.

Large Language Model

Transformers

JapaneseOpen Source License:MIT #Japanese Masked Prediction #RoBERTa Architecture #Large-scale Pretraining

Downloads 9,375

Release Time : 3/2/2022

Model Overview

This is a Japanese pretrained language model based on the RoBERTa architecture, primarily used for masked word prediction tasks in Japanese text.

Model Features

Japanese-specific Pretraining

Specifically pretrained for Japanese text, optimized for Japanese linguistic characteristics

Based on RoBERTa Architecture

Utilizes an improved BERT architecture, removing the next sentence prediction task and trained with larger batches and more data

SentencePiece Tokenization

Uses a SentencePiece-based tokenizer trained on Japanese Wikipedia

Model Capabilities

Masked word prediction

Japanese text understanding

Contextual semantic analysis

Use Cases

Text Completion

Japanese Text Masked Word Prediction

Predicts masked Japanese vocabulary

Accurately predicted words like 'オリンピック' in examples

Language Model Fine-tuning

Downstream NLP Tasks

Can be used as a base model for fine-tuning various Japanese NLP tasks

🚀 japanese-roberta-base

This repository offers a base-sized Japanese RoBERTa model. It was trained using code from the Github repository rinnakk/japanese-pretrained-models by rinna Co., Ltd.. The model aims to provide effective solutions for Japanese natural language processing tasks.

🚀 Quick Start

Loading the Model

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-roberta-base", use_fast=False)
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

model = AutoModelForMaskedLM.from_pretrained("rinna/japanese-roberta-base")

Using the Model for Masked Token Prediction

⚠️ Important Note

Use [CLS]: To predict a masked token, add a [CLS] token before the sentence as it was used during model training.
Use [MASK] after tokenization: Directly typing [MASK] in an input string and replacing a token with [MASK] after tokenization yield different results. It's better to use [MASK] after tokenization, but the Huggingface Inference API only supports typing [MASK] in the input string.
Provide position_ids explicitly: When position_ids are not provided for a Roberta* model, Huggingface's transformers constructs it starting from padding_idx instead of 0, which doesn't work well with rinna/japanese-roberta-base. So, construct position_ids starting from 0 manually.

💡 Usage Tip

When using the model for masked token prediction, carefully follow the above notes to get accurate results.

Example

# original text
text = "4年に1度オリンピックは開かれる。"

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']

# mask a token
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

# convert to tensor
import torch
token_tensor = torch.LongTensor([token_ids])

# provide position ids explicitly
position_ids = list(range(0, token_tensor.size(1)))
print(position_ids)  # output: [0, 1, 2, 3, 4, 5, 6, 7, 8]
position_id_tensor = torch.LongTensor([position_ids])

# get the top 10 predictions of the masked token
with torch.no_grad():
    outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

"""
0 総会
1 サミット
2 ワールドカップ
3 フェスティバル
4 大会
5 オリンピック
6 全国大会
7 党大会
8 イベント
9 世界選手権
"""

✨ Features

Transformer-based: A 12-layer, 768-hidden-size transformer-based masked language model.
Trained on large datasets: Trained on Japanese CC-100 and Japanese Wikipedia to optimize a masked language modelling objective.

📚 Documentation

Model architecture

A 12-layer, 768-hidden-size transformer-based masked language model.

Training

The model was trained on Japanese CC-100 and Japanese Wikipedia to optimize a masked language modelling objective on 8*V100 GPUs for around 15 days. It reaches ~3.9 perplexity on a dev set sampled from CC-100.

Tokenization

The model uses a sentencepiece-based tokenizer, and the vocabulary was trained on the Japanese Wikipedia using the official sentencepiece training script.

Release date

August 25, 2021

How to cite

@misc{rinna-japanese-roberta-base,
    title = {rinna/japanese-roberta-base},
    author = {Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-roberta-base}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

📄 License

The MIT license

📦 Additional Information

Property	Details
Model Type	A 12-layer, 768-hidden-size transformer-based masked language model
Training Data	Japanese CC-100 and Japanese Wikipedia
Mask Token	`[MASK]`
Tags	roberta, masked-lm, nlp
Thumbnail	rinna-icon
Widget Example	`[CLS]4年に1度[MASK]は開かれる。`

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご