deberta-v3-japanese-large Open Source Model - Optimized for Japanese, No Morphological Analyzer Required

Deberta V3 Japanese Large

Developed by globis-university

A large-scale DeBERTa V3 model trained on Japanese resources, optimized for Japanese language processing without requiring a morphological analyzer and respecting word boundaries.

Large Language Model

Transformers

Japanese#Japanese Optimization #No Morphological Analysis Required #Word Boundary Aware

Downloads 519.17k

Release Time : 9/21/2023

Model Overview

This is a DeBERTa V3 model trained on Japanese resources, featuring optimizations for Japanese language processing. It does not require a morphological analyzer during inference and respects word boundaries to a certain extent.

Model Features

Japanese Optimization

Designed specifically for Japanese, enabling inference without the need for a morphological analyzer.

Word Boundary Respect

Tokens do not cross word boundaries, avoiding the generation of cross-word tokens.

Compact Vocabulary

Compared to the original DeBERTa V3's extensive vocabulary, this model uses a more compact vocabulary size.

Hugging Face Ecosystem Compatible

The tokenizer is fully compatible with the Hugging Face ecosystem.

Model Capabilities

Japanese Text Understanding

Token Classification

Natural Language Processing

Use Cases

Natural Language Processing

Japanese Text Analysis

Used for in-depth analysis and understanding of Japanese text.

Japanese Token Classification

Performs token-level classification tasks on Japanese text.

🚀 [DeBERTa V3 Japanese Large Model]

This is a DeBERTa V3 model pre - trained on Japanese resources. It offers unique features tailored for Japanese language processing, eliminating the need for a morphological analyzer during inference and respecting word boundaries.

🚀 Quick Start

To use this model, you can follow the code example below:

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-large'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

✨ Features

Based on Well - known Architecture: Utilizes the established DeBERTa V3 model.
Japanese - Specific: Specialized for the Japanese language.
No Morphological Analyzer: Does not require a morphological analyzer during inference.
Respect Word Boundaries: To some extent, it respects word boundaries and does not produce tokens spanning multiple words like の都合上 or の判定負けを喫し.

📚 Documentation

Tokenizer

The tokenizer is trained using the method introduced by Kudo. Key points include:

No Morphological Analyzer Needed: Eliminates the need for a morphological analyzer during inference.
Respect Word Boundaries: Tokens do not cross word boundaries (dictionary: unidic-cwj-202302).
Hugging Face Compatibility: Easy to use with Hugging Face.
Smaller Vocabulary Size: Addresses the issue of excessive embedding layer parameters in the original DeBERTa V3 by adopting a smaller vocabulary size.

Note that among the three models (xsmall, base, and large), the first two were trained using the unigram algorithm, while only the large model was trained using the BPE algorithm. This was due to unsuccessful training with the unigram algorithm when increasing the vocabulary size of the large model.

Data

Property	Details
Dataset Name	Wikipedia (2023/07; WikiExtractor), Wikipedia (2023/07; [cl - tohoku's method](https://github.com/cl - tohoku/bert - japanese/blob/main/make_corpus_wiki.py)), WikiBooks (2023/07; [cl - tohoku's method](https://github.com/cl - tohoku/bert - japanese/blob/main/make_corpus_wiki.py)), Aozora Bunko (2023/07; [globis - university/aozorabunko - clean](https://huggingface.co/globis - university/globis - university/aozorabunko - clean)), CC - 100 (ja), mC4 (ja; extracted 10%, with Wikipedia - like focus via DSIR), OSCAR 2023 (ja; extracted 10%, with Wikipedia - like focus via DSIR)
Notes	See above for each dataset's specific notes
File Size (with metadata)	3.5GB, 4.8GB, 43MB, 496MB, 90GB, 91GB, 26GB
Factor	x2, x2, x2, x4, x1, x1, x1

Training parameters

Number of devices: 8
Batch size: 8 x 8
Learning rate: 6.4e - 5
Maximum sequence length: 512
Optimizer: AdamW
Learning rate scheduler: Linear schedule with warmup
Training steps: 2,000,000
Warmup steps: 100,000
Precision: Mixed (fp16)
Vocabulary size: 48,000

Evaluation

Model	#params	JSTS	JNLI	JSQuAD	JCQA
≤ small
[izumi - lab/deberta - v2 - small - japanese](https://huggingface.co/izumi - lab/deberta - v2 - small - japanese)	17.8M	0.890/0.846	0.880	-	0.737
[globis - university/deberta - v3 - japanese - xsmall](https://huggingface.co/globis - university/deberta - v3 - japanese - xsmall)	33.7M	0.916/0.880	0.913	0.869/0.938	0.821
base
[cl - tohoku/bert - base - japanese - v3](https://huggingface.co/cl - tohoku/bert - base - japanese - v3)	111M	0.919/0.881	0.907	0.880/0.946	0.848
[nlp - waseda/roberta - base - japanese](https://huggingface.co/nlp - waseda/roberta - base - japanese)	111M	0.913/0.873	0.895	0.864/0.927	0.840
[izumi - lab/deberta - v2 - base - japanese](https://huggingface.co/izumi - lab/deberta - v2 - base - japanese)	110M	0.919/0.882	0.912	-	0.859
[ku - nlp/deberta - v2 - base - japanese](https://huggingface.co/ku - nlp/deberta - v2 - base - japanese)	112M	0.922/0.886	0.922	0.899/0.951	-
[ku - nlp/deberta - v3 - base - japanese](https://huggingface.co/ku - nlp/deberta - v3 - base - japanese)	160M	0.927/0.891	0.927	0.896/-	-
[globis - university/deberta - v3 - japanese - base](https://huggingface.co/globis - university/deberta - v3 - japanese - base)	110M	0.925/0.895	0.921	0.890/0.950	0.886
large
[cl - tohoku/bert - large - japanese - v2](https://huggingface.co/cl - tohoku/bert - large - japanese - v2)	337M	0.926/0.893	0.929	0.893/0.956	0.893
[nlp - waseda/roberta - large - japanese](https://huggingface.co/nlp - waseda/roberta - large - japanese)	337M	0.930/0.896	0.924	0.884/0.940	0.907
[nlp - waseda/roberta - large - japanese - seq512](https://huggingface.co/nlp - waseda/roberta - large - japanese - seq512)	337M	0.926/0.892	0.926	0.918/0.963	0.891
[ku - nlp/deberta - v2 - large - japanese](https://huggingface.co/ku - nlp/deberta - v2 - large - japanese)	339M	0.925/0.892	0.924	0.912/0.959	-
[globis - university/deberta - v3 - japanese - large](https://huggingface.co/globis - university/deberta - v3 - japanese - large)	352M	0.928/0.896	0.924	0.896/0.956	0.900

📄 License

CC BY SA 4.0

🤝 Acknowledgement

We used ABCI for computing resources. Thank you.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご