roberta-large-japanese-with-auto-jumanpp Open-source Japanese Model - Supports Automatic Word Segmentation and Handles Japanese Texts

Roberta Large Japanese With Auto Jumanpp

Developed by nlp-waseda

A large Japanese RoBERTa model pretrained on Japanese Wikipedia and the Japanese portion of CC-100, supporting automatic Juman++ tokenization

Large Language Model

Transformers

Japanese#Japanese Masked Language Modeling #Juman++ Automatic Tokenization #Large-scale Pretraining

Downloads 139

Release Time : 10/15/2022

Model Overview

This is a large Japanese RoBERTa model specifically pretrained for Japanese natural language processing tasks, supporting masked language modeling and fine-tuning for downstream tasks.

Model Features

Automatic Juman++ Tokenization

Supports automatic tokenization with Juman++, simplifying preprocessing workflows

Large-scale Pretraining

Pretrained on Japanese Wikipedia and the Japanese portion of CC-100, covering a wide range of Japanese corpora

High-performance Tokenization

Combines JumanDIC and sentencepiece to provide a rich vocabulary of 32,000 tokens

Model Capabilities

Japanese Text Understanding

Masked Language Modeling

Downstream Task Fine-tuning

Use Cases

Natural Language Processing

Text Completion

Predicts words masked by [MASK] tokens in sentences

Text Classification

Can be fine-tuned for classification tasks such as sentiment analysis

🚀 nlp-waseda/roberta-large-japanese-with-auto-jumanpp

This project presents a large Japanese RoBERTa model, pretrained on Japanese Wikipedia and the Japanese part of CC - 100, aiming to offer powerful natural language processing capabilities for Japanese texts.

🚀 Quick Start

You can use this model for masked language modeling as follows:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")

sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...

You can fine-tune this model on downstream tasks.

✨ Features

Pretrained on large - scale data: Trained on Japanese Wikipedia and the Japanese portion of CC - 100, providing rich language knowledge.
Support for Juman++ tokenization: BertJapaneseTokenizer supports automatic tokenization for Juman++.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")
model = AutoModelForMaskedLM.from_pretrained("nlp-waseda/roberta-large-japanese-with-auto-jumanpp")

sentence = '早稲田大学で自然言語処理を[MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...

Advanced Usage

# You can fine-tune this model on downstream tasks according to your specific needs.
# For example, you can define a custom training loop and adjust hyperparameters.

📚 Documentation

Tokenization

BertJapaneseTokenizer now supports automatic tokenization for Juman++. However, if your dataset is large, you may take a long time since BertJapaneseTokenizer still does not support fast tokenization. You can still do the Juman++ tokenization by yourself and use the old model nlp-waseda/roberta-large-japanese.

Juman++ 2.0.0 - rc3 was used for pretraining. Each word is tokenized into tokens by sentencepiece.

Vocabulary

The vocabulary consists of 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.

Training procedure

This model was trained on Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100. It took two weeks using eight NVIDIA A100 GPUs.

The following hyperparameters were used during pretraining:

learning_rate: 6e - 5
per_device_train_batch_size: 103
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 5
total_train_batch_size: 4120
max_seq_length: 128
optimizer: Adam with betas=(0.9,0.98) and epsilon = 1e - 6
lr_scheduler_type: linear
training_steps: 670000
warmup_steps: 10000
mixed_precision_training: Native AMP

Performance on JGLUE

See the Baseline Scores of JGLUE.

🔧 Technical Details

This model is a Japanese RoBERTa large - scale model. It uses Juman++ for word - level tokenization and sentencepiece for sub - word tokenization. The vocabulary size is 32000, which includes words from JumanDIC and sub - words induced by the unigram language model of sentencepiece. During the training process, it uses eight NVIDIA A100 GPUs and takes two weeks to train on Japanese Wikipedia and the Japanese portion of CC - 100.

📄 License

This model is licensed under the CC - BY - SA 4.0 license.

Property	Details
Model Type	Japanese RoBERTa large model
Training Data	Japanese Wikipedia (as of 20210920) and the Japanese portion of CC - 100

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご