Debarta-v3-japanese-xsmall Open-source AI Model - Optimized for Japanese, No Morphological Analyzer Required for Inference

Deberta V3 Japanese Xsmall

Developed by globis-university

A DeBERTa V3 model trained on Japanese resources, optimized for Japanese, and does not rely on a morphological analyzer during inference

Large Language Model

Transformers

Japanese#Japanese Optimization #No Morphological Analysis Required #Word Boundary Preservation

Downloads 96

Release Time : 9/21/2023

Model Overview

This is a DeBERTa V3 model trained on Japanese resources, optimized for Japanese, capable of inference without relying on a morphological analyzer while preserving word boundaries to some extent.

Model Features

Japanese Optimization

Designed specifically for Japanese, optimizing Japanese text processing capabilities

No Morphological Analyzer Required

Does not rely on a morphological analyzer during inference, simplifying deployment

Word Boundary Preservation

Preserves word boundaries to some extent, avoiding the generation of cross-token markers

Compatible with Hugging Face Ecosystem

Fully compatible with the Hugging Face ecosystem, facilitating integration and usage

Model Capabilities

Japanese Text Processing

Token Classification

Natural Language Understanding

Use Cases

Text Analysis

Japanese Text Classification

Classify Japanese text

Japanese Named Entity Recognition

Identify named entities in Japanese text

🚀 DeBERTa V3 Japanese Model

This is a model based on DeBERTa V3 pre-trained on Japanese resources. It addresses the challenges of Japanese language processing and offers high - performance solutions for related tasks.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-xsmall'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

✨ Features

Based on the well - known DeBERTa V3 model.
Specialized for the Japanese language.
Does not use a morphological analyzer during inference.
Respects word boundaries to some extent (does not produce tokens spanning multiple words like の都合上 or の判定負けを喫し).

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-xsmall'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

📚 Documentation

Tokenizer

The tokenizer is trained using the method introduced by Kudo.

Key points include:

No morphological analyzer needed during inference.
Tokens do not cross word boundaries (dictionary: unidic-cwj-202302).
Easy to use with Hugging Face.
Smaller vocabulary size.

Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the microsoft/deberta-v3-base model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.

Note that, among the three models: xsmall, base, and large, the first two were trained using the unigram algorithm, while only the large model was trained using the BPE algorithm. The reason for this is simple: while the large model was independently trained to increase its vocabulary size, for some reason, training with the unigram algorithm was not successful. Thus, prioritizing the completion of the model over investigating the cause, we switched to the BPE algorithm.

Data

Dataset Name	Notes	File Size (with metadata)	Factor
Wikipedia	2023/07; WikiExtractor	3.5GB	x2
Wikipedia	2023/07; [cl - tohoku's method](https://github.com/cl - tohoku/bert - japanese/blob/main/make_corpus_wiki.py)	4.8GB	x2
WikiBooks	2023/07; [cl - tohoku's method](https://github.com/cl - tohoku/bert - japanese/blob/main/make_corpus_wiki.py)	43MB	x2
Aozora Bunko	2023/07; [globis - university/aozorabunko - clean](https://huggingface.co/globis - university/globis - university/aozorabunko - clean)	496MB	x4
CC - 100	ja	90GB	x1
mC4	ja; extracted 10%, with Wikipedia - like focus via DSIR	91GB	x1
OSCAR 2023	ja; extracted 10%, with Wikipedia - like focus via DSIR	26GB	x1

Training parameters

Number of devices: 8
Batch size: 48 x 8
Learning rate: 3.84e - 4
Maximum sequence length: 512
Optimizer: AdamW
Learning rate scheduler: Linear schedule with warmup
Training steps: 1,000,000
Warmup steps: 100,000
Precision: Mixed (fp16)
Vocabulary size: 32,000

Evaluation

Model	#params	JSTS	JNLI	JSQuAD	JCQA
≤ small
[izumi - lab/deberta - v2 - small - japanese](https://huggingface.co/izumi - lab/deberta - v2 - small - japanese)	17.8M	0.890/0.846	0.880	-	0.737
[globis - university/deberta - v3 - japanese - xsmall](https://huggingface.co/globis - university/deberta - v3 - japanese - xsmall)	33.7M	0.916/0.880	0.913	0.869/0.938	0.821
base
[cl - tohoku/bert - base - japanese - v3](https://huggingface.co/cl - tohoku/bert - base - japanese - v3)	111M	0.919/0.881	0.907	0.880/0.946	0.848
[nlp - waseda/roberta - base - japanese](https://huggingface.co/nlp - waseda/roberta - base - japanese)	111M	0.913/0.873	0.895	0.864/0.927	0.840
[izumi - lab/deberta - v2 - base - japanese](https://huggingface.co/izumi - lab/deberta - v2 - base - japanese)	110M	0.919/0.882	0.912	-	0.859
[ku - nlp/deberta - v2 - base - japanese](https://huggingface.co/ku - nlp/deberta - v2 - base - japanese)	112M	0.922/0.886	0.922	0.899/0.951	-
[ku - nlp/deberta - v3 - base - japanese](https://huggingface.co/ku - nlp/deberta - v3 - base - japanese)	160M	0.927/0.891	0.927	0.896/-	-
[globis - university/deberta - v3 - japanese - base](https://huggingface.co/globis - university/deberta - v3 - japanese - base)	110M	0.925/0.895	0.921	0.890/0.950	0.886
large
[cl - tohoku/bert - large - japanese - v2](https://huggingface.co/cl - tohoku/bert - large - japanese - v2)	337M	0.926/0.893	0.929	0.893/0.956	0.893
[nlp - waseda/roberta - large - japanese](https://huggingface.co/nlp - waseda/roberta - large - japanese)	337M	0.930/0.896	0.924	0.884/0.940	0.907
[nlp - waseda/roberta - large - japanese - seq512](https://huggingface.co/nlp - waseda/roberta - large - japanese - seq512)	337M	0.926/0.892	0.926	0.918/0.963	0.891
[ku - nlp/deberta - v2 - large - japanese](https://huggingface.co/ku - nlp/deberta - v2 - large - japanese)	339M	0.925/0.892	0.924	0.912/0.959	-
[globis - university/deberta - v3 - japanese - large](https://huggingface.co/globis - university/deberta - v3 - japanese - large)	352M	0.928/0.896	0.924	0.896/0.956	0.900

📄 License

CC BY SA 4.0

Acknowledgement

We used ABCI for computing resources. Thank you.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご