🚀 Japanese BART base
A pre - trained Japanese BART base model on Japanese Wikipedia, useful for various natural language processing tasks.
🚀 Quick Start
You can use this model as follows:
from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-base-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-base-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'
encoding = tokenizer(sentence, return_tensors='pt')
...
You can fine - tune this model on downstream tasks.
✨ Features
This is a Japanese BART base model pre - trained on Japanese Wikipedia.
📦 Installation
The README does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, MBartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/bart-base-japanese')
model = MBartForConditionalGeneration.from_pretrained('ku-nlp/bart-base-japanese')
sentence = '京都 大学 で 自然 言語 処理 を 専攻 する 。'
encoding = tokenizer(sentence, return_tensors='pt')
...
Advanced Usage
You can fine - tune this model on downstream tasks. The code for fine - tuning is not provided in the original README.
📚 Documentation
Tokenization
The input text should be segmented into words by Juman++ in advance. Juman++ 2.0.0 - rc3 was used for pre - training. Each word is tokenized into subwords by sentencepiece.
Training data
We used the following corpora for pre - training:
- Japanese Wikipedia (18M sentences)
Training procedure
We first segmented texts in the corpora into words using Juman++.
Then, we built a sentencepiece model with 32000 tokens including words (JumanDIC) and subwords induced by the unigram language model of sentencepiece.
We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese BART model using fairseq library.
The training took 2 weeks using 4 Tesla V100 GPUs.
The following hyperparameters were used during pre - training:
- distributed_type: multi - GPU
- num_devices: 4
- batch_size: 512
- training_steps: 500,000
- encoder layers: 6
- decoder layers: 6
- hidden size: 768
🔧 Technical Details
The model is pre - trained on Japanese Wikipedia. The input text needs to be pre - processed by Juman++ for word segmentation and then tokenized into subwords by sentencepiece. The training process involves building a sentencepiece model and using the fairseq library for training with specific hyperparameters on 4 Tesla V100 GPUs for 2 weeks.
📄 License
The model is licensed under CC - BY - SA 4.0.