bart-base-cantonese Open-source Cantonese Model - Empowering Cantonese-related Text Processing Applications

Bart Base Cantonese

Developed by Ayaka

This is a Cantonese model based on the base version of BART, obtained through second-phase pre-training on the LIHKG dataset.

Large Language Model OtherOpen Source License:Other #Cantonese text generation #Masked language modeling #LIHKG dataset

Downloads 42

Release Time : 10/25/2022

Model Overview

This model is primarily used for Cantonese masked language modeling tasks and can generate coherent Cantonese sentences.

Model Features

Cantonese support

Specially trained for Cantonese, capable of understanding and generating authentic Cantonese text.

Based on BART architecture

Adopts the BART base architecture, featuring powerful sequence-to-sequence modeling capabilities.

Second-phase pre-training

Second-phase pre-training on the LIHKG dataset enhances the model's understanding of Cantonese.

Model Capabilities

Text generation

Masked language modeling

Cantonese text processing

Use Cases

Text processing

Cantonese sentence completion

Automatically completes incomplete Cantonese sentences

Example: Input '聽日就要返香港，我激動到[MASK]唔着', output '聽日就要返香港，我激動到瞓唔着'

🚀 bart-base-cantonese

This is the Cantonese model of BART base. It is obtained by a second - stage pre - training on the LIHKG dataset based on the fnlp/bart - base - chinese model, offering a powerful tool for Cantonese - related natural language processing tasks.

🚀 Quick Start

This is the Cantonese model of BART base. It is obtained by a second - stage pre - training on the LIHKG dataset based on the fnlp/bart-base-chinese model.

This project is supported by Cloud TPUs from Google's TPU Research Cloud (TRC).

⚠️ Important Note

To avoid any copyright issues, please do not use this model for any purpose.

✨ Features

This model is specifically designed for Cantonese language processing. It leverages second - stage pre - training on a large Cantonese dataset, enhancing its performance on Cantonese tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)  
output = text2text_generator('聽日就要返香港，我激動到[MASK]唔着', max_length=50, do_sample=False)
print(output[0]['generated_text'].replace(' ', ''))
# output: 聽日就要返香港，我激動到瞓唔着

⚠️ Important Note

Please use the BertTokenizer for the model vocabulary. DO NOT use the original BartTokenizer.

📚 Documentation

GitHub Links

Dataset: ayaka14732/lihkg-scraper
Tokeniser: ayaka14732/bert-tokenizer-cantonese
Base model: ayaka14732/bart-base-jax
Pre-training: ayaka14732/bart-base-cantonese

🔧 Technical Details

Optimiser: SGD 0.03 + Adaptive Gradient Clipping 0.1
Dataset: 172937863 sentences, pad or truncate to 64 tokens
Batch size: 640
Number of epochs: 7 epochs + 61440 steps
Time: 44.0 hours on Google Cloud TPU v4 - 16

WandB link: 1j7zs802

📄 License

The license for this project is other.

Property	Details
Language	Cantonese
Tags	cantonese
Library Name	transformers
CO2 Eq Emissions	Emissions: 6.29; Source: estimated by using ML CO2 Calculator; Training Type: second - stage pre - training; Hardware Used: Google Cloud TPU v4 - 16
Pipeline Tag	fill - mask

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご