đ [DeBERTa V3 Japanese Large Model]
This is a DeBERTa V3 model pre - trained on Japanese resources. It offers unique features tailored for Japanese language processing, eliminating the need for a morphological analyzer during inference and respecting word boundaries.
đ Quick Start
To use this model, you can follow the code example below:
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = 'globis-university/deberta-v3-japanese-large'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
⨠Features
- Based on Well - known Architecture: Utilizes the established DeBERTa V3 model.
- Japanese - Specific: Specialized for the Japanese language.
- No Morphological Analyzer: Does not require a morphological analyzer during inference.
- Respect Word Boundaries: To some extent, it respects word boundaries and does not produce tokens spanning multiple words like
ãŽéŊåä¸
or ãŽå¤åŽč˛ ããåĢã
.
đ Documentation
Tokenizer
The tokenizer is trained using the method introduced by Kudo. Key points include:
- No Morphological Analyzer Needed: Eliminates the need for a morphological analyzer during inference.
- Respect Word Boundaries: Tokens do not cross word boundaries (dictionary:
unidic-cwj-202302
).
- Hugging Face Compatibility: Easy to use with Hugging Face.
- Smaller Vocabulary Size: Addresses the issue of excessive embedding layer parameters in the original DeBERTa V3 by adopting a smaller vocabulary size.
Note that among the three models (xsmall
, base
, and large
), the first two were trained using the unigram algorithm, while only the large
model was trained using the BPE algorithm. This was due to unsuccessful training with the unigram algorithm when increasing the vocabulary size of the large
model.
Data
Property |
Details |
Dataset Name |
Wikipedia (2023/07; WikiExtractor), Wikipedia (2023/07; [cl - tohoku's method](https://github.com/cl - tohoku/bert - japanese/blob/main/make_corpus_wiki.py)), WikiBooks (2023/07; [cl - tohoku's method](https://github.com/cl - tohoku/bert - japanese/blob/main/make_corpus_wiki.py)), Aozora Bunko (2023/07; [globis - university/aozorabunko - clean](https://huggingface.co/globis - university/globis - university/aozorabunko - clean)), CC - 100 (ja), mC4 (ja; extracted 10%, with Wikipedia - like focus via DSIR), OSCAR 2023 (ja; extracted 10%, with Wikipedia - like focus via DSIR) |
Notes |
See above for each dataset's specific notes |
File Size (with metadata) |
3.5GB, 4.8GB, 43MB, 496MB, 90GB, 91GB, 26GB |
Factor |
x2, x2, x2, x4, x1, x1, x1 |
Training parameters
- Number of devices: 8
- Batch size: 8 x 8
- Learning rate: 6.4e - 5
- Maximum sequence length: 512
- Optimizer: AdamW
- Learning rate scheduler: Linear schedule with warmup
- Training steps: 2,000,000
- Warmup steps: 100,000
- Precision: Mixed (fp16)
- Vocabulary size: 48,000
Evaluation
Model |
#params |
JSTS |
JNLI |
JSQuAD |
JCQA |
⤠small |
|
|
|
|
|
[izumi - lab/deberta - v2 - small - japanese](https://huggingface.co/izumi - lab/deberta - v2 - small - japanese) |
17.8M |
0.890/0.846 |
0.880 |
- |
0.737 |
[globis - university/deberta - v3 - japanese - xsmall](https://huggingface.co/globis - university/deberta - v3 - japanese - xsmall) |
33.7M |
0.916/0.880 |
0.913 |
0.869/0.938 |
0.821 |
base |
|
|
|
|
|
[cl - tohoku/bert - base - japanese - v3](https://huggingface.co/cl - tohoku/bert - base - japanese - v3) |
111M |
0.919/0.881 |
0.907 |
0.880/0.946 |
0.848 |
[nlp - waseda/roberta - base - japanese](https://huggingface.co/nlp - waseda/roberta - base - japanese) |
111M |
0.913/0.873 |
0.895 |
0.864/0.927 |
0.840 |
[izumi - lab/deberta - v2 - base - japanese](https://huggingface.co/izumi - lab/deberta - v2 - base - japanese) |
110M |
0.919/0.882 |
0.912 |
- |
0.859 |
[ku - nlp/deberta - v2 - base - japanese](https://huggingface.co/ku - nlp/deberta - v2 - base - japanese) |
112M |
0.922/0.886 |
0.922 |
0.899/0.951 |
- |
[ku - nlp/deberta - v3 - base - japanese](https://huggingface.co/ku - nlp/deberta - v3 - base - japanese) |
160M |
0.927/0.891 |
0.927 |
0.896/- |
- |
[globis - university/deberta - v3 - japanese - base](https://huggingface.co/globis - university/deberta - v3 - japanese - base) |
110M |
0.925/0.895 |
0.921 |
0.890/0.950 |
0.886 |
large |
|
|
|
|
|
[cl - tohoku/bert - large - japanese - v2](https://huggingface.co/cl - tohoku/bert - large - japanese - v2) |
337M |
0.926/0.893 |
0.929 |
0.893/0.956 |
0.893 |
[nlp - waseda/roberta - large - japanese](https://huggingface.co/nlp - waseda/roberta - large - japanese) |
337M |
0.930/0.896 |
0.924 |
0.884/0.940 |
0.907 |
[nlp - waseda/roberta - large - japanese - seq512](https://huggingface.co/nlp - waseda/roberta - large - japanese - seq512) |
337M |
0.926/0.892 |
0.926 |
0.918/0.963 |
0.891 |
[ku - nlp/deberta - v2 - large - japanese](https://huggingface.co/ku - nlp/deberta - v2 - large - japanese) |
339M |
0.925/0.892 |
0.924 |
0.912/0.959 |
- |
[globis - university/deberta - v3 - japanese - large](https://huggingface.co/globis - university/deberta - v3 - japanese - large) |
352M |
0.928/0.896 |
0.924 |
0.896/0.956 |
0.900 |
đ License
CC BY SA 4.0
đ¤ Acknowledgement
We used ABCI for computing resources. Thank you.