🚀 MacBERT for Chinese Spelling Correction (macbert4csc) Model
A model for Chinese spelling correction
The macbert4csc-base-chinese
model evaluated on the SIGHAN2015 test data shows the following results:
|
Correct-Precision |
Correct-Recall |
Correct-F1 |
Character-level |
93.72 |
86.40 |
89.91 |
Sentence-level |
82.64 |
73.66 |
77.89 |
Since the training data uses the SIGHAN2015 training set (to reproduce the paper), the model achieves the SOTA level on the SIGHAN2015 test set.
The model structure is modified from softmaskedbert:

🚀 Quick Start
✨ Features
- High accuracy in Chinese spelling correction.
- Based on the MacBERT architecture with innovative pre - training tasks.
📦 Installation
This project is part of the Chinese text correction project pycorrector. You can install relevant dependencies according to the project's requirements.
💻 Usage Examples
Basic Usage
This project is open - sourced in the Chinese text correction project: pycorrector, which supports the macbert4csc model. You can call it using the following command:
from pycorrector.macbert.macbert_corrector import MacBertCorrector
m = MacBertCorrector("shibing624/macbert4csc-base-chinese")
i = m.correct('今天新情很好')
print(i)
Advanced Usage
Of course, you can also use transformers
to call:
import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)
texts = ["今天新情很好", "你找到你最喜欢的工作,我也很高心。"]
with torch.no_grad():
outputs = model(**tokenizer(texts, padding=True, return_tensors='pt').to(device))
def get_errors(corrected_text, origin_text):
sub_details = []
for i, ori_char in enumerate(origin_text):
if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
continue
if i >= len(corrected_text):
continue
if ori_char != corrected_text[i]:
if ori_char.lower() == corrected_text[i]:
corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
continue
sub_details.append((ori_char, corrected_text[i], i, i + 1))
sub_details = sorted(sub_details, key=operator.itemgetter(2))
return corrected_text, sub_details
result = []
for ids, text in zip(outputs.logits, texts):
_text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
corrected_text = _text[:len(text)]
corrected_text, details = get_errors(corrected_text, text)
print(text, ' => ', corrected_text, details)
result.append((corrected_text, details))
print(result)
Output:
今天新情很好 => 今天心情很好 [('新', '心', 2, 3)]
你找到你最喜欢的工作,我也很高心。 => 你找到你最喜欢的工作,我也很高兴。 [('心', '兴', 15, 16)]
The model files are composed as follows:
macbert4csc-base-chinese
├── config.json
├── added_tokens.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.txt
📚 Documentation
Training Datasets
SIGHAN + Wang271K Chinese Correction Dataset
The data format of the SIGHAN + Wang271K Chinese correction dataset is as follows:
[
{
"id": "B2-4029-3",
"original_text": "晚间会听到嗓音,白天的时候大家都不会太在意,但是在睡觉的时候这嗓音成为大家的恶梦。",
"wrong_ids": [
5,
31
],
"correct_text": "晚间会听到噪音,白天的时候大家都不会太在意,但是在睡觉的时候这噪音成为大家的恶梦。"
}
]
macbert4csc
├── config.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.txt
If you need to train macbert4csc, please refer to https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert
🔧 Technical Details
MacBERT is an improved BERT with a novel MLM as correction pre - training task, which mitigates the discrepancy between pre - training and fine - tuning.
Here is an example of our pre - training task.
Task |
Example |
Original Sentence |
we use a language model to predict the probability of the next word. |
MLM |
we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word . |
Whole word masking |
we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word . |
N - gram masking |
we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word . |
MLM as correction |
we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word . |
Except for the new pre - training task, we also incorporate the following techniques.
- Whole Word Masking (WWM)
- N - gram masking
- Sentence - Order Prediction (SOP)
Note that our MacBERT can be directly replaced with the original BERT as there is no difference in the main neural architecture.
For more technical details, please check our paper: Revisiting Pre - trained Models for Chinese Natural Language Processing
📄 License
This project is licensed under the Apache - 2.0 license.
Citation
@software{pycorrector,
author = {Xu Ming},
title = {pycorrector: Text Error Correction Tool},
year = {2021},
url = {https://github.com/shibing624/pycorrector},
}