Deberta-v2-base-japanese Open-Source Japanese Model - Supports Language Modeling and Fine-Tuning for Downstream Tasks

Deberta V2 Base Japanese

Developed by ku-nlp

A Japanese DeBERTa V2 base model pretrained on Japanese Wikipedia, CC-100, and OSCAR corpora, suitable for masked language modeling and downstream task fine-tuning.

Large Language Model

Transformers

Japanese#Japanese Masked Language Modeling #Juman++ Tokenization #Wikipedia Pretraining

Downloads 38.93k

Release Time : 1/5/2023

Model Overview

This is a DeBERTa V2 model pretrained on large-scale Japanese corpora, primarily designed for Japanese masked language modeling tasks, but can also be fine-tuned for various natural language understanding tasks.

Model Features

High-Quality Japanese Pretraining

Pretrained on high-quality Japanese corpora including Japanese Wikipedia, CC-100, and OSCAR, covering a wide range of Japanese linguistic features.

Professional Tokenization

Input text requires professional tokenization via Juman++ to ensure accurate understanding of Japanese text by the model.

Multi-Task Adaptability

In addition to masked language modeling, it can be fine-tuned for various natural language understanding tasks such as text classification and question answering.

Model Capabilities

Japanese Text Understanding

Masked Language Modeling

Natural Language Processing Task Fine-tuning

Use Cases

Natural Language Understanding

Text Classification

Can be used for Japanese text classification tasks such as sentiment analysis and topic classification.

Achieved 0.970 accuracy on the MARC-ja task

Semantic Similarity Calculation

Can be used to calculate semantic similarity between Japanese text pairs.

Achieved Pearson correlation coefficient of 0.922 on the JSTS task

Question Answering System

Can be used to build Japanese question answering systems.

Achieved F1 score of 0.951 on the JSQuAD task

🚀 Japanese DeBERTa V2 base

A pre - trained Japanese language model based on DeBERTa V2 architecture, trained on multiple Japanese corpora for masked language modeling and downstream NLU tasks.

🚀 Quick Start

You can use this model for masked language modeling as follows:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('ku - nlp/deberta - v2 - base - japanese')
model = AutoModelForMaskedLM.from_pretrained('ku - nlp/deberta - v2 - base - japanese')

sentence = '京都 大学 で 自然 言語 処理 を [MASK] する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')
...

You can also fine - tune this model on downstream tasks.

✨ Features

Pre - trained on Multiple Corpora: Trained on Japanese Wikipedia, the Japanese portion of CC - 100, and the Japanese portion of OSCAR.
High Accuracy: Achieved an accuracy of 0.779 on the masked language modeling task.
Suitable for NLU Tasks: Can be fine - tuned on various NLU tasks in the Japanese language.

📦 Installation

No specific installation steps are provided in the original README. If you want to use this model, you need to have the transformers library installed. You can install it using pip install transformers.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('ku - nlp/deberta - v2 - base - japanese')
model = AutoModelForMaskedLM.from_pretrained('ku - nlp/deberta - v2 - base - japanese')

sentence = '京都 大学 で 自然 言語 処理 を [MASK] する 。'  # input should be segmented into words by Juman++ in advance
encoding = tokenizer(sentence, return_tensors='pt')

📚 Documentation

Model description

This is a Japanese DeBERTa V2 base model pre - trained on Japanese Wikipedia, the Japanese portion of CC - 100, and the Japanese portion of OSCAR.

Tokenization

The input text should be segmented into words by [Juman++](https://github.com/ku - nlp/jumanpp) in advance. [Juman++ 2.0.0 - rc3](https://github.com/ku - nlp/jumanpp/releases/tag/v2.0.0 - rc3) was used for pre - training. Each word is tokenized into subwords by sentencepiece.

Training data

We used the following corpora for pre - training:

Japanese Wikipedia (as of 20221020, 3.2GB, 27M sentences, 1.3M documents)
Japanese portion of CC - 100 (85GB, 619M sentences, 66M documents)
Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)

Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR. Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of CC - 100 and OSCAR. As a result, the total size of the training data is 171GB.

Training procedure

We first segmented texts in the corpora into words using [Juman++](https://github.com/ku - nlp/jumanpp). Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku - nlp/JumanDIC)) and subwords induced by the unigram language model of sentencepiece.

We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model using transformers library. The training took three weeks using 8 NVIDIA A100 - SXM4 - 40GB GPUs.

The following hyperparameters were used during pre - training:

learning_rate: 2e - 4
per_device_train_batch_size: 44
distributed_type: multi - GPU
num_devices: 8
gradient_accumulation_steps: 6
total_train_batch_size: 2,112
max_seq_length: 512
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 06
lr_scheduler_type: linear schedule with warmup
training_steps: 500,000
warmup_steps: 10,000

The accuracy of the trained model on the masked language modeling task was 0.779. The evaluation set consists of 5,000 randomly sampled documents from each of the training corpora.

Fine - tuning on NLU tasks

We fine - tuned the following models and evaluated them on the dev set of JGLUE. We tuned learning rate and training epochs for each model and task following the JGLUE paper.

Model	MARC - ja/acc	JSTS/pearson	JSTS/spearman	JNLI/acc	JSQuAD/EM	JSQuAD/F1	JComQA/acc
Waseda RoBERTa base	0.965	0.913	0.876	0.905	0.853	0.916	0.853
Waseda RoBERTa large (seq512)	0.969	0.925	0.890	0.928	0.910	0.955	0.900
LUKE Japanese base*	0.965	0.916	0.877	0.912	-	-	0.842
LUKE Japanese large*	0.965	0.932	0.902	0.927	-	-	0.893
DeBERTaV2 base	0.970	0.922	0.886	0.922	0.899	0.951	0.873
DeBERTaV2 large	0.968	0.925	0.892	0.924	0.912	0.959	0.890

*The scores of LUKE are from [the official repository](https://github.com/studio - ousia/luke).

🔧 Technical Details

Model Architecture: Based on the DeBERTa V2 architecture.
Tokenization Process: Involves word segmentation by Juman++ and sub - word tokenization by sentencepiece.
Training Hardware: 8 NVIDIA A100 - SXM4 - 40GB GPUs were used for training.

📄 License

The model is licensed under CC - BY - SA - 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご