🚀 Japanese T5 Pretrained Model
This is a T5 (Text-to-Text Transfer Transformer) model pretrained on a Japanese corpus. It addresses the need for a high - quality language model in Japanese NLP tasks, offering a foundation for various text - related applications.
This T5 (Text - to - Text Transfer Transformer) model has been pretrained on the following Japanese corpora (approximately 100GB):
- Japanese dump data from Wikipedia (as of July 6, 2020)
- Japanese corpus from [OSCAR](https://oscar - corpus.com)
- Japanese corpus from [CC - 100](http://data.statmt.org/cc - 100/)
This model has only undergone pre - training. To use it for specific tasks, fine - tuning is required. Also, like other language models trained on large - scale corpora, this model potentially has the issue of producing biased (unethical, harmful, or discriminatory) output results due to the bias in the training data content. Please use this model only for purposes that will not cause harm, taking into account the possibility of such issues.
The entire Wikipedia data mentioned above was used to train the SentencePiece tokenizer.
🚀 Quick Start
Transfer Learning Sample Code
You can find the sample code for transfer learning at [https://github.com/sonoisa/t5 - japanese](https://github.com/sonoisa/t5 - japanese).
✨ Features
Benchmark
Livedoor News Classification Task
The accuracy of the news article genre prediction task using the Livedoor news corpus is as follows. Compared to Google's multilingual T5 model, this model has a 25% smaller model size and about 6 percentage points higher accuracy.
Japanese T5 ([t5 - base - japanese](https://huggingface.co/sonoisa/t5 - base - japanese), with 222M parameters, [reproduction code](https://github.com/sonoisa/t5 - japanese/blob/main/t5_japanese_classification.ipynb))
label |
precision |
recall |
f1 - score |
support |
0 |
0.96 |
0.94 |
0.95 |
130 |
1 |
0.98 |
0.99 |
0.99 |
121 |
2 |
0.96 |
0.96 |
0.96 |
123 |
3 |
0.86 |
0.91 |
0.89 |
82 |
4 |
0.96 |
0.97 |
0.97 |
129 |
5 |
0.96 |
0.96 |
0.96 |
141 |
6 |
0.98 |
0.98 |
0.98 |
127 |
7 |
1.00 |
0.99 |
1.00 |
127 |
8 |
0.99 |
0.97 |
0.98 |
120 |
accuracy |
|
|
0.97 |
1100 |
macro avg |
0.96 |
0.96 |
0.96 |
1100 |
weighted avg |
0.97 |
0.97 |
0.97 |
1100 |
Comparison target: Multilingual T5 ([google/mt5 - small](https://huggingface.co/google/mt5 - small), with 300M parameters)
label |
precision |
recall |
f1 - score |
support |
0 |
0.91 |
0.88 |
0.90 |
130 |
1 |
0.84 |
0.93 |
0.89 |
121 |
2 |
0.93 |
0.80 |
0.86 |
123 |
3 |
0.82 |
0.74 |
0.78 |
82 |
4 |
0.90 |
0.95 |
0.92 |
129 |
5 |
0.89 |
0.89 |
0.89 |
141 |
6 |
0.97 |
0.98 |
0.97 |
127 |
7 |
0.95 |
0.98 |
0.97 |
127 |
8 |
0.93 |
0.95 |
0.94 |
120 |
accuracy |
|
|
0.91 |
1100 |
macro avg |
0.91 |
0.90 |
0.90 |
1100 |
weighted avg |
0.91 |
0.91 |
0.91 |
1100 |
JGLUE Benchmark
The results of the JGLUE benchmark are as follows (to be added sequentially):
- MARC - ja: In preparation
- JSTS: In preparation
- JNLI: In preparation
- JSQuAD: EM = 0.900, F1 = 0.945, [reproduction code](https://github.com/sonoisa/t5 - japanese/blob/main/t5_JSQuAD.ipynb)
- JCommonsenseQA: In preparation
📄 License
[CC - BY SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/deed.ja)
Please also note that you need to comply with the [Terms of Use of Common Crawl](http://commoncrawl.org/terms - of - use/).
⚠️ Important Note
The author of this model has taken great care in creating it, but does not guarantee the accuracy, safety, etc. of the model's output and assumes no responsibility. In case any inconvenience or damage occurs to the user due to the use of this model, the author of the model or dataset and the author's affiliated organization shall not assume any responsibility. The user is obliged to clarify that the author and the affiliated organization will not be held responsible.