roberta-large-japanese Open-source Japanese Model - Free Deployment to Boost Japanese Natural Language Processing Tasks

Roberta Large Japanese

Developed by nlp-waseda

A large Japanese RoBERTa model pretrained on Japanese Wikipedia and the Japanese portion of CC-100, suitable for Japanese natural language processing tasks.

Large Language Model

Transformers

Japanese#Japanese Pretraining #Masked Language Modeling #Juman++ Tokenization

Downloads 227

Release Time : 5/10/2022

Model Overview

This is a large Japanese RoBERTa model pretrained on Japanese Wikipedia and the Japanese portion of CC-100, primarily used for masked language modeling and fine-tuning downstream tasks in Japanese text.

Model Features

Japanese-specific Pretraining

Specifically pretrained for Japanese text, optimizing performance for Japanese natural language processing tasks.

Juman++ Tokenization Support

Input text must be pre-tokenized using Juman++ to ensure efficient processing of Japanese text.

Large-scale Training Data

Trained on Japanese Wikipedia and the Japanese portion of CC-100, covering a wide range of Japanese corpus.

High-performance Hardware Training

Trained using eight NVIDIA A100 GPUs over two weeks to ensure high model quality.

Model Capabilities

Japanese Text Understanding

Masked Language Modeling

Downstream Task Fine-tuning

Use Cases

Natural Language Processing

Japanese Text Infilling

Use masked language modeling to fill in missing parts of Japanese text.

Downstream Task Fine-tuning

Fine-tune the model on specific Japanese NLP tasks such as text classification and named entity recognition.

🚀 nlp-waseda/roberta-large-japanese

A large Japanese RoBERTa model pretrained on Japanese Wikipedia and the Japanese portion of CC - 100.

🚀 Quick Start

This is a Japanese RoBERTa large model pretrained on Japanese Wikipedia and the Japanese portion of CC - 100. You can use this model for masked language modeling and fine - tune it on downstream tasks.

✨ Features

The model is pretrained on Japanese Wikipedia and the Japanese portion of CC - 100.
It uses Juman++ for word segmentation and sentencepiece for tokenization.
BertJapaneseTokenizer supports automatic JumanppTokenizer and SentencepieceTokenizer.
The vocabulary consists of 32000 tokens including words and subwords.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-large-japanese")

sentence = '早稲田 大学 で 自然 言語 処理 を [MASK] する 。' # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

Advanced Usage

You can fine - tune this model on downstream tasks. The specific fine - tuning code is not provided in the original README.

📚 Documentation

Tokenization

The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.

BertJapaneseTokenizer now supports automatic JumanppTokenizer and SentencepieceTokenizer. You can use this model without any data preprocessing.

Vocabulary

The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

Training procedure

This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100. It took two weeks using eight NVIDIA A100 GPUs.

The following hyperparameters were used during pretraining:

Property	Details
learning_rate	6e - 5
per_device_train_batch_size	103
distributed_type	multi - GPU
num_devices	8
gradient_accumulation_steps	5
total_train_batch_size	4120
max_seq_length	128
optimizer	Adam with betas=(0.9,0.98) and epsilon = 1e - 6
lr_scheduler_type	linear
training_steps	670000
warmup_steps	10000
mixed_precision_training	Native AMP

Performance on JGLUE

See the Baseline Scores of JGLUE.

📄 License

This project is licensed under the CC - BY - SA 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご