Macbert4csc V2
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 macbert4csc_v2
macbert4csc_v2 is a model for Chinese spelling correction. It uses the macbert4csc architecture and has unique features in training and inference. This model can be called through the transformers library or the macro - correct project.
🚀 Quick Start
macbert4csc_v2 is mainly used for Chinese Spelling Correction (CSC) evaluation (text correction). The project address is https://github.com/yongzhuo/macro-correct.
Model Features
- The model weights are macbert4csc_v2, using the macbert4csc architecture (pycorrector version). Its feature is that a new branch is added after BertForMaskedLM for the error detection task (classification task, non - interactive).
- During training, MFT (dynamically mask 0.2 of the non - error tokens) is used, and the weight of det_loss is 0.3.
- During inference, the part after macbert (det - layer) is discarded.
Usage
- Call using transformers.
- Call using the macro - correct project. For details, see III. Usage.
✨ Features
- Accurate Chinese Spelling Correction: It can effectively correct various Chinese spelling errors, including common typos and errors in different fields.
- Unique Architecture: The macbert4csc architecture with an additional error detection branch improves the accuracy of error detection.
- Flexible Usage: It can be called through different methods, providing convenience for users.
📦 Installation
The document does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
Using macro - correct
import os
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"
from macro_correct import correct
### Default correction (list input)
text_list = ["真麻烦你了。希望你们好好的跳无",
"少先队员因该为老人让坐",
"机七学习是人工智能领遇最能体现智能的一个分知",
"一只小鱼船浮在平净的河面上"
]
text_csc = correct(text_list)
print("Default correction (list input):")
for res_i in text_csc:
print(res_i)
print("#" * 128)
"""
Default correction (list input):
{'index': 0, 'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好地跳舞', 'errors': [['的', '地', 12, 0.6584], ['无', '舞', 14, 1.0]]}
{'index': 1, 'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让坐', 'errors': [['因', '应', 4, 0.995]]}
{'index': 2, 'source': '机七学习是人工智能领遇最能体现智能的一个分知', 'target': '机器学习是人工智能领域最能体现智能的一个分支', 'errors': [['七', '器', 1, 0.9998], ['遇', '域', 10, 0.9999], ['知', '支', 21, 1.0]]}
{'index': 3, 'source': '一只小鱼船浮在平净的河面上', 'target': '一只小鱼船浮在平静的河面上', 'errors': [['净', '静', 8, 0.9961]]}
"""
Using transformers
# !/usr/bin/python
# -*- coding: utf-8 -*-
# @time : 2021/2/29 21:41
# @author : Mo
# @function: transformers directly load bert - type models for testing
import traceback
import time
import sys
import os
os.environ["USE_TORCH"] = "1"
from transformers import BertConfig, BertTokenizer, BertForMaskedLM
import torch
# pretrained_model_name_or_path = "shibing624/macbert4csc-base-chinese"
# pretrained_model_name_or_path = "Macropodus/macbert4mdcspell_v1"
# pretrained_model_name_or_path = "Macropodus/macbert4csc_v1"
pretrained_model_name_or_path = "Macropodus/macbert4csc_v2"
# pretrained_model_name_or_path = "Macropodus/bert4csc_v1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_len = 128
print("load model, please wait a few minute!")
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)
bert_config = BertConfig.from_pretrained(pretrained_model_name_or_path)
model = BertForMaskedLM.from_pretrained(pretrained_model_name_or_path)
model.to(device)
print("load model success!")
texts = [
"机七学习是人工智能领遇最能体现智能的一个分知",
"我是练习时长两念半的鸽仁练习生蔡徐坤",
"真麻烦你了。希望你们好好的跳无",
"他法语说的很好,的语也不错",
"遇到一位很棒的奴生跟我疗天",
"我们为这个目标努力不解",
]
len_mid = min(max_len, max([len(t)+2 for t in texts]))
with torch.no_grad():
outputs = model(**tokenizer(texts, padding=True, max_length=len_mid,
return_tensors="pt").to(device))
def get_errors(source, target):
""" Minimal method to get errors """
len_min = min(len(source), len(target))
errors = []
for idx in range(len_min):
if source[idx] != target[idx]:
errors.append([source[idx], target[idx], idx])
return errors
result = []
for probs, source in zip(outputs.logits, texts):
ids = torch.argmax(probs, dim=-1)
tokens_space = tokenizer.decode(ids[1:-1], skip_special_tokens=False)
text_new = tokens_space.replace(" ", "")
target = text_new[:len(source)]
errors = get_errors(source, target)
print(source, " => ", target, errors)
result.append([target, errors])
print(result)
"""
机七学习是人工智能领遇最能体现智能的一个分知 => 机器学习是人工智能领域最能体现智能的一个分支 [['七', '器', 1], ['遇', '域', 10], ['知', '支', 21]]
我是练习时长两念半的鸽仁练习生蔡徐坤 => 我是练习时长两年半的个人练习生蔡徐坤 [['念', '年', 7], ['鸽', '个', 10], ['仁', '人', 11]]
真麻烦你了。希望你们好好的跳无 => 真麻烦你了。希望你们好好地跳舞 [['的', '地', 12], ['无', '舞', 14]]
他法语说的很好,的语也不错 => 他法语说得很好,德语也不错 [['的', '得', 4], ['的', '德', 8]]
遇到一位很棒的奴生跟我疗天 => 遇到一位很棒的女生跟我聊天 [['奴', '女', 7], ['疗', '聊', 11]]
我们为这个目标努力不解 => 我们为这个目标努力不懈 [['解', '懈', 10]]
"""
📚 Documentation
I. Test
1.1 Test Data Source
The address is Macropodus/csc_eval_public. All training data comes from the public network or open - source data. The training data is about 10 million, and the confusion dictionary is large.
1. gen_de3.json(5545): Correction of '的地得', manually generated from high - quality data such as People's Daily, Learning Power, and chinese - poetry.
2. lemon_v2.tet.json(1053): Data proposed by the relm paper, a multi - domain spelling correction dataset (7 domains), including game (GAM), encyclopedia (ENC), contract (COT), medical care (MEC), car (CAR), novel (NOV), and news (NEW).
3. acc_rmrb.tet.json(4636): From NER - 199801 (high - quality corpus of People's Daily).
4. acc_xxqg.tet.json(5000): From the high - quality corpus of the Learning Power website.
5. gen_passage.tet.json(10000): The source data is good words and sentences generated by qwen, generated by the confusion dictionary summarized from almost all open - source data.
6. textproof.tet.json(1447): NLP competition data, TextProofreadingCompetition.
7. gen_xxqg.tet.json(5000): The source data is the high - quality corpus of the Learning Power website, generated by the confusion dictionary summarized from almost all open - source data.
8. faspell.dev.json(1000): A dataset obtained after OCR of video subtitles, from the faspell paper of iQiyi.
9. lomo_tet.json(5000): Mainly a Chinese spelling correction dataset for similar - sounding words, from Tencent, an artificially annotated dataset CSCD - NS.
10. mcsc_tet.5000.json(5000): Medical spelling correction, from the real historical logs of Tencent Yidian APP. Note that the paper says this dataset only focuses on the correction of medical entities and does not focus on the correction of common characters.
11. ecspell.dev.json(1500): From the ECSpell paper, including three domains such as (law/med/gov).
12. sighan2013.dev.json(1000): From the sighan13 conference.
13. sighan2014.dev.json(1062): From the sighan14 conference.
14. sighan2015.dev.json(1100): From the sighan15 conference.
1.2 Test Data Preprocessing
The test data has undergone operations such as full - width to half - width conversion, simplified - traditional conversion, and punctuation standardization.
1.3 Other Notes
1. Indicators with 'common' are extremely lenient indicators, the same as the evaluation indicators of the open - source project pycorrector.
2. Indicators with 'strict' are extremely strict indicators, the same as the open - source project [wangwang110/CSC](https://github.com/wangwang110/CSC).
3. The macbert4mdcspell_v1 model uses the mdcspell architecture + bert's mlm - loss for training, but only uses bert - mlm during inference.
4. The acc_rmrb/acc_xxqg datasets have no errors and are used to evaluate the model's over - correction rate.
5. The model of qwen25_1 - 5b_pycorrector is shibing624/chinese - text - correction - 1.5b, and its training data includes the validation and test sets of lemon_v2/mcsc_tet/ecspell, while the training of other bert - type models does not include the validation and test sets.
II. Important Indicators
2.1 F1(common_cor_f1)
model/common_cor_f1 | avg | gen_de3 | lemon_v2 | gen_passage | text_proof | gen_xxqg | faspell | lomo_tet | mcsc_tet | ecspell | sighan2013 | sighan2014 | sighan2015 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
macbert4csc_pycorrector | 45.8 | 42.44 | 42.89 | 31.49 | 46.31 | 26.06 | 32.7 | 44.83 | 27.93 | 55.51 | 70.89 | 61.72 | 66.81 |
bert4csc_v1 | 62.28 | 93.73 | 61.99 | 44.79 | 68.0 | 35.03 | 48.28 | 61.8 | 64.41 | 79.11 | 77.66 | 51.01 | 61.54 |
macbert4csc_v1 | 68.55 | 96.67 | 65.63 | 48.4 | 75.65 | 38.43 | 51.76 | 70.11 | 80.63 | 85.55 | 81.38 | 57.63 | 70.7 |
macbert4csc_v2 | 68.6 | 96.74 | 66.02 | 48.26 | 75.78 | 38.84 | 51.91 | 70.17 | 80.71 | 85.61 | 80.97 | 58.22 | 69.95 |
macbert4mdcspell_v1 | 71.1 | 96.42 | 70.06 | 52.55 | 79.61 | 43.37 | 53.85 | 70.9 | 82.38 | 87.46 | 84.2 | 61.08 | 71.32 |
qwen25_1 - 5b_pycorrector | 45.11 | 27.29 | 89.48 | 14.61 | 83.9 | 13.84 | 18.2 | 36.71 | 96.29 | 88.2 | 36.41 | 15.64 | 20.73 |
2.2 acc(common_cor_acc)
model/common_cor_acc | avg | gen_de3 | lemon_v2 | gen_passage | text_proof | gen_xxqg | faspell | lomo_tet | mcsc_tet | ecspell | sighan2013 | sighan2014 | sighan2015 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
macbert4csc_pycorrector | 48.26 | 26.96 | 28.68 | 34.16 | 55.29 | 28.38 | 22.2 | 60.96 | 57.16 | 67.73 | 55.9 | 68.93 | 72.73 |
bert4csc_v1 | 60.76 | 88.21 | 45.96 | 43.13 | 68.97 | 35.0 | 34.0 | 65.86 | 73.26 | 81.8 | 64.5 | 61.11 | 67.27 |
macbert4csc_v1 | 65.34 | 93.56 | 49.76 | 44.98 | 74.64 | 36.1 | 37.0 | 73.0 | 83.6 | 86.87 | 69.2 | 62.62 | 72.73 |
macbert4csc_v2 | 65.22 | 93.69 | 50.14 | 44.92 | 74.64 | 36.26 | 37.0 | 72.72 | 83.66 | 86.93 | 68.5 | 62.43 | 71.73 |
macbert4mdcspell_v1 | 67.15 | 93.09 | 54.8 | 47.71 | 78.09 | 39.52 | 38.8 | 71.92 | 84.78 | 88.27 | 73.2 | 63.28 | 72.36 |
qwen25_1 - 5b_pycorrector | 46.09 | 15.82 | 81.29 | 22.96 | 82.17 | 19.04 | 12.8 | 50.2 | 96.4 | 89.13 | 22.8 | 27.87 | 32.55 |
2.3 acc(acc_true, thr = 0.75)
model/acc | avg | acc_rmrb | acc_xxqg |
---|---|---|---|
macbert4csc_pycorrector | 99.24 | 99.22 | 99.26 |
bert4csc_v1 | 98.71 | 98.36 | 99.06 |
macbert4csc_v1 | 97.72 | 96.72 | 98.72 |
macbert4csc_v2 | 97.89 | 96.98 | 98.8 |
macbert4mdcspell_v1 | 97.75 | 96.51 | 98.98 |
qwen25_1 - 5b_pycorrector | 82.0 | 77.14 | 86.86 |
II. Conclusion
1. Models such as macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1 are trained with data from multiple domains, which is relatively balanced and suitable as a pre - trained model for the first step, and can be used for further fine - tuning of proprietary domain data.
2. Comparing macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1 and observing Table 2.3, it can be found that the more training data, the higher the accuracy, but the over - correction rate will also be slightly higher.
3. MFT (Mask - Correct) is still effective, but the improvement is not obvious when the data volume is sufficient, which may also be an important reason for the increase in the over - correction rate.
4. There is also classical Chinese data in the training data, and the trained model also supports the correction of classical Chinese.
5. The trained model has a high recognition and correction rate for high - frequency errors such as '地得的'.
🔧 Technical Details
The model uses the macbert4csc architecture. During training, MFT is used, and the weight of det_loss is set. During inference, part of the model is discarded. The test data comes from multiple sources and has been pre - processed. The evaluation indicators are divided into lenient and strict types.
📄 License
The model is licensed under the Apache - 2.0 license.
📚 Papers
- 2024 - Refining: Refining Corpora from a Model Calibration Perspective for Chinese
- 2024 - ReLM: Chinese Spelling Correction as Rephrasing Language Model
- 2024 - DICS: DISC: Plug - and - Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check
- 2023 - Bi - DCSpell: A Bi - directional Detector - Corrector Interactive Framework for Chinese Spelling Check
- 2023 - BERT - MFT: Rethinking Masked Language Modeling for Chinese Spelling Correction
- 2023 - PTCSpell: PTCSpell: Pre - trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction
- 2023 - DR - CSC: [A Frustratingly Easy Plug - and - Play Detection - and - Reasoning Module for Chinese](https://aclanthology.org/2023.findings - emnlp.771)
- 2023 - DROM: Disentangled Phonetic Representation for Chinese Spelling Correction
- 2023 - EGCM: An Error - Guided Correction Model for Chinese Spelling Error Correction
- 2023 - IGPI: Investigating Glyph - Phonetic Information for Chinese Spell Checking: What Works and What’s Next?
- 2023 - CL: Contextual Similarity is More Valuable than Character Similarity - An Empirical Study for Chinese Spell Checking
- 2022 - CRASpell: [CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction](https://aclanthology.org/2022.findings - acl.237)
- 2022 - MDCSpell: [MDCSpell: A Multi - task Detector - Corrector Framework for Chinese Spelling Correction](https://aclanthology.org/2022.findings - acl.98)
- 2022 - SCOPE: Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity
- 2022 - ECOPO: The Past Mistake is the Future Wisdom: Error - driven Contrastive Probability Optimization for Chinese Spell Checking
- 2021 - MLMPhonetics: [Correcting Chinese Spelling Errors with Phonetic Pre - training](https://aclanthology.org/2021.findings - acl.198)
- 2021 - ChineseBERT: [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://aclanthology.org/2021.acl - long.161/)
- 2021 - BERTCrsGad: [Global Attention Decoder for Chinese Spelling Error Correction](https://aclanthology.org/2021.findings - acl.122)
- 2021 - ThinkTwice: [Think Twice: A Post - Processing Approach for the Chinese Spelling Error Correction](https://www.mdpi.com/2076 - 3417/11/13/5832)
- 2021 - PHMOSpell: [PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Chec](https://aclanthology.org/2021.acl - long.464)
- 2021 - SpellBERT: [SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check](https://aclanthology.org/2021.emnlp - main.287)
- 2021 - TwoWays: [Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models](https://aclanthology.org/2021.acl - short.56)
- 2021 - ReaLiSe: Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking
- 2021 - DCSpell: DCSpell: A Detector - Corrector Framework for Chinese Spelling Error Correction
- 2021 - PLOME: [PLOME: Pre - training with Misspelled Knowledge for Chinese Spelling Correction](https://aclanthology.org/2021.acl - long.233)
- 2021 - DCN: [Dynamic Connected Networks for Chinese Spelling Check](https://aclanthology.org/2021.findings - acl.216/)
- 2020 - SoftMaskBERT: Spelling Error Correction with Soft - Masked BERT
- 2020 - SpellGCN: SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
- 2020 - ChunkCSC: [Chunk - based Chinese Spelling Check with Global Optimization](https://aclanthology.org/2020.findings - emnlp.184)
- 2020 - MacBERT: Revisiting Pre - Trained Models for Chinese Natural Language Processing
- 2019 - FASPell: [FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE - Decoder Paradigm](https://aclanthology.org/D19 - 5522)
- 2018 - Hybrid: [A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking](https://aclanthology.org/D18 - 1273)
- 2015 - Sighan15: [Introduction to SIGHAN 2015 Bake - off for Chinese Spelling Check](https://aclanthology.org/W15 - 3106/)
- 2014 - Sighan14: [Overview of SIGHAN 2014 Bake - off for Chinese Spelling Check](https://aclanthology.org/W14 - 6820/)
- 2013 - Sighan13: [Chinese Spelling Check Evaluation at SIGHAN Bake - off 2013](https://aclanthology.org/W13 - 4406/)
📚 References
- [nghuyong/Chinese - text - correction - papers](https://github.com/nghuyong/Chinese - text - correction - papers)
- destwang/CTCResources
- wangwang110/CSC
- [chinese - poetry/chinese - poetry](https://github.com/chinese - poetry/chinese - poetry)
- [chinese - poetry/huajianji](https://github.com/chinese - poetry/huajianji)
- garychowcmu/daizhigev20
- yangjianxin1/Firefly
- Macropodus/xuexiqiangguo_428w
- Macropodus/csc_clean_wang271k
- Macropodus/csc_eval_public
- shibing624/pycorrector
- iioSnail/MDCSpell_pytorch
- gingasan/lemon
- [Claude - Liu/ReLM](https://github.com/Claude - Liu/ReLM)
📚 Citation
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
@software{macro - correct,
url = {https://github.com/yongzhuo/macro - correct},
author = {Yongzhuo Mo},
title = {macro - correct},
year = {2025}
}

