roberta-base-japanese: An Open-Source Japanese Pre-trained Model - Powering Japanese Applications with Multi-source Data

Home

Roberta Base Japanese

Developed by nlp-waseda

A Japanese RoBERTa-based pretrained model, trained on Japanese Wikipedia and the Japanese portion of CC-100.

Large Language Model

Transformers

Japanese#Japanese Pretraining #Juman++ Tokenization #Masked Language Modeling

Downloads 456

Release Time : 3/2/2022

Model Overview

This is a Japanese pretrained model based on the RoBERTa architecture, primarily used for masked language modeling tasks in Japanese. The model is trained on large-scale Japanese corpora and is suitable for various Japanese natural language processing tasks.

Model Features

Japanese-Specific Pretraining

Specifically pretrained for Japanese, using Japanese Wikipedia and the Japanese portion of CC-100 as training data

Juman++ Tokenization Support

Input text must be tokenized using Juman++ to ensure optimal processing of Japanese text

Large Vocabulary

Includes 32,000 tokens, combining JumanDIC vocabulary and subwords generated by sentencepiece

Efficient Training

Trained for one week using 8 NVIDIA A100 GPUs with various optimization techniques

Model Capabilities

Japanese Text Understanding

Masked Language Prediction

Downstream Task Fine-Tuning

Use Cases

Natural Language Processing

Text Completion

Predicts words replaced by the [MASK] token in sentences

Accurately predicts missing words in Japanese text

Text Classification

Can be fine-tuned for tasks like sentiment analysis and topic classification

Named Entity Recognition

Can be fine-tuned to identify entities such as person names and locations in Japanese text

🚀 nlp-waseda/roberta-base-japanese

This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC-100, which can be used for masked language modeling and fine - tuned on downstream tasks.

🚀 Quick Start

You can use this model for masked language modeling as follows:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese")

sentence = '早稲田 大学 で 自然 言語 処理 を [MASK] する 。' # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine - tune this model on downstream tasks.

✨ Features

This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC - 100.
The BertJapaneseTokenizer now supports automatic JumanppTokenizer and SentencepieceTokenizer. You can use this model without any data preprocessing.

📚 Documentation

Tokenization

The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.

Vocabulary

The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

Training procedure

This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100. It took a week using eight NVIDIA A100 GPUs.

The following hyperparameters were used during pretraining:

learning_rate: 1e - 4
per_device_train_batch_size: 256
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 4096
max_seq_length: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
training_steps: 700000
warmup_steps: 10000
mixed_precision_training: Native AMP

Performance on JGLUE

See the Baseline Scores of JGLUE.

📄 License

This model is licensed under CC - BY - SA - 4.0.

Information Table

Property	Details
Model Type	Japanese RoBERTa base model
Training Data	Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100
Mask Token	"[MASK]"
Datasets	wikipedia, cc100

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご