roberta-base-japanese-with-auto-jumanpp Open-source Japanese Model - Automatic Word Segmentation to Boost Language Processing

Roberta Base Japanese With Auto Jumanpp

Developed by nlp-waseda

A Japanese pretrained model based on RoBERTa architecture, supporting automatic Juman++ tokenization, suitable for Japanese natural language processing tasks.

Large Language Model

Transformers

Japanese#Japanese Masked Language Model #Juman++ Automatic Tokenization #Wikipedia Pretraining

Downloads 536

Release Time : 10/15/2022

Model Overview

This is a base model based on Japanese RoBERTa, pretrained on Japanese Wikipedia and the Japanese portion of CC-100, supporting masked language modeling and downstream task fine-tuning.

Model Features

Auto Juman++ Tokenization Support

BertJapaneseTokenizer now supports automatic tokenization with Juman++, simplifying Japanese text processing.

Large-scale Pretraining Data

The model was trained on Japanese Wikipedia and the Japanese portion of CC-100, covering a wide range of Japanese language features.

Optimized Training Process

Trained for one week using 8 NVIDIA A100 GPUs with advanced training strategies and hyperparameter settings.

Model Capabilities

Japanese Text Understanding

Masked Language Modeling

Downstream Task Fine-tuning

Use Cases

Natural Language Processing

Text Completion

Using masked language modeling to fill in missing parts of Japanese sentences

Text Classification

Achieving Japanese text classification tasks through model fine-tuning

🚀 nlp-waseda/roberta-base-japanese-with-auto-jumanpp

This is a Japanese RoBERTa base model pretrained on Japanese Wikipedia and the Japanese portion of CC - 100, offering capabilities for masked language modeling and downstream task fine - tuning.

🚀 Quick Start

You can use this model for masked language modeling as follows:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")

sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine - tune this model on downstream tasks.

✨ Features

Masked Language Modeling: Can be used for masked language modeling tasks.
Downstream Task Fine - Tuning: Allows fine - tuning on various downstream tasks.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")

sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...

Advanced Usage

# Fine - tune the model on a downstream task
# First, prepare your dataset and dataloader
# Then, define your training loop with appropriate loss function and optimizer
# Here is a simple pseudo - code example
from transformers import AutoTokenizer, AutoModelForMaskedLM, AdamW
import torch

tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-base-japanese-with-auto-jumanpp")
optimizer = AdamW(model.parameters(), lr = 1e-4)

# Assume train_dataloader is your prepared dataloader
for epoch in range(num_epochs):
    for batch in train_dataloader:
        inputs = tokenizer(batch['text'], return_tensors='pt')
        labels = batch['labels']
        outputs = model(**inputs, labels = labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

📚 Documentation

Tokenization

BertJapaneseTokenizer now supports automatic tokenization for Juman++. However, if your dataset is large, you may take a long time since BertJapaneseTokenizer still does not support fast tokenization. You can still do the Juman++ tokenization by yourself and use the old model nlp-waseda/roberta-base-japanese.

Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.

Vocabulary

The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

Training procedure

This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100. It took a week using eight NVIDIA A100 GPUs.

The following hyperparameters were used during pretraining:

learning_rate: 1e - 4
per_device_train_batch_size: 256
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 4096
max_seq_length: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
training_steps: 700000
warmup_steps: 10000
mixed_precision_training: Native AMP

Performance on JGLUE

See the Baseline Scores of JGLUE.

🔧 Technical Details

This model is based on the RoBERTa architecture, which is a robustly optimized BERT - like model. It was pretrained on large - scale Japanese corpora (Japanese Wikipedia and the Japanese portion of CC - 100) using masked language modeling objective. The use of Juman++ for tokenization helps in better handling of the Japanese language structure, and sentencepiece is used for further sub - tokenization. The hyperparameters were carefully tuned to achieve good performance on the masked language modeling task and can be further adjusted for downstream tasks.

📄 License

This model is licensed under the CC - BY - SA 4.0 license.

Property	Details
Model Type	Japanese RoBERTa base model
Training Data	Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご