🚀 bart-base-cantonese
This is the Cantonese model of BART base. It is obtained by a second - stage pre - training on the LIHKG dataset based on the fnlp/bart - base - chinese model, offering a powerful tool for Cantonese - related natural language processing tasks.
🚀 Quick Start
This is the Cantonese model of BART base. It is obtained by a second - stage pre - training on the LIHKG dataset based on the fnlp/bart-base-chinese model.
This project is supported by Cloud TPUs from Google's TPU Research Cloud (TRC).
⚠️ Important Note
To avoid any copyright issues, please do not use this model for any purpose.
✨ Features
This model is specifically designed for Cantonese language processing. It leverages second - stage pre - training on a large Cantonese dataset, enhancing its performance on Cantonese tasks.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
output = text2text_generator('聽日就要返香港,我激動到[MASK]唔着', max_length=50, do_sample=False)
print(output[0]['generated_text'].replace(' ', ''))
⚠️ Important Note
Please use the BertTokenizer
for the model vocabulary. DO NOT use the original BartTokenizer
.
📚 Documentation
GitHub Links
🔧 Technical Details
- Optimiser: SGD 0.03 + Adaptive Gradient Clipping 0.1
- Dataset: 172937863 sentences, pad or truncate to 64 tokens
- Batch size: 640
- Number of epochs: 7 epochs + 61440 steps
- Time: 44.0 hours on Google Cloud TPU v4 - 16
WandB link: 1j7zs802
📄 License
The license for this project is other.
Property |
Details |
Language |
Cantonese |
Tags |
cantonese |
Library Name |
transformers |
CO2 Eq Emissions |
Emissions: 6.29; Source: estimated by using ML CO2 Calculator; Training Type: second - stage pre - training; Hardware Used: Google Cloud TPU v4 - 16 |
Pipeline Tag |
fill - mask |