Macbert4csc_v2 Open-source Chinese Spelling Correction Model - A Practical Choice for Multi-domain Text Correction

Macbert4csc V2

Developed by Macropodus

macbert4csc_v2 is a model for Chinese spelling correction. It adopts a specific architecture and training strategy and performs well on multiple evaluation datasets. It is suitable for text correction tasks in various fields.

Large Language Model

PyTorch

ChineseOpen Source License:Apache-2.0 #Chinese spelling correction #Dynamic mask training #Multi-domain applicability

Downloads 112

Release Time : 1/16/2025

Model Overview

This model is mainly used for Chinese spelling correction and supports correction tasks for texts in multiple fields, including classical Chinese and common high-frequency errors such as 'de' (地, 得, 的).

Model Features

Specific architecture design

A new error detection branch (classification task) is added after BertForMaskedLM, and different strategies are adopted during training and inference.

Efficient training strategy

Use MFT (dynamically mask 0.2 of non - error tokens) for training, and the weight of det_loss is 0.3.

Multi - domain applicability

Trained with data from multiple domains, it is suitable as a pre - trained model and can be used for further fine - tuning of data in specific domains.

Classical Chinese support

The training data includes classical Chinese data, supporting classical Chinese correction.

High - frequency error handling

It has a high recognition rate and correction rate for high - frequency errors such as 'de' (地, 得, 的).

Model Capabilities

Chinese text spelling correction

Multi - domain text correction

Classical Chinese correction

High - frequency error recognition

Use Cases

General text correction

Daily text correction

Correct spelling errors in daily texts.

Example: '少先队员因该为老人让坐' → '少先队员应该为老人让坐'

Professional field correction

Correct spelling errors in texts in professional fields.

Example: '机七学习是人工智能领遇最能体现智能的一个分知' → '机器学习是人工智能领域最能体现智能的一个分支'

Specific error type handling

Correction of 'de' (地, 得, 的)

Specifically handle the common usage errors of 'de' (地, 得, 的) in Chinese.

Example: '希望你们好好的跳无' → '希望你们好好地跳舞'

🚀 macbert4csc_v2

macbert4csc_v2 is a model for Chinese spelling correction. It uses the macbert4csc architecture and has unique features in training and inference. This model can be called through the transformers library or the macro - correct project.

🚀 Quick Start

macbert4csc_v2 is mainly used for Chinese Spelling Correction (CSC) evaluation (text correction). The project address is https://github.com/yongzhuo/macro-correct.

Model Features

The model weights are macbert4csc_v2, using the macbert4csc architecture (pycorrector version). Its feature is that a new branch is added after BertForMaskedLM for the error detection task (classification task, non - interactive).
During training, MFT (dynamically mask 0.2 of the non - error tokens) is used, and the weight of det_loss is 0.3.
During inference, the part after macbert (det - layer) is discarded.

Usage

Call using transformers.
Call using the macro - correct project. For details, see III. Usage.

✨ Features

Accurate Chinese Spelling Correction: It can effectively correct various Chinese spelling errors, including common typos and errors in different fields.
Unique Architecture: The macbert4csc architecture with an additional error detection branch improves the accuracy of error detection.
Flexible Usage: It can be called through different methods, providing convenience for users.

📦 Installation

The document does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

Using macro - correct

import os
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"

from macro_correct import correct
### Default correction (list input)
text_list = ["真麻烦你了。希望你们好好的跳无",
             "少先队员因该为老人让坐",
             "机七学习是人工智能领遇最能体现智能的一个分知",
             "一只小鱼船浮在平净的河面上"
             ]
text_csc = correct(text_list)
print("Default correction (list input):")
for res_i in text_csc:
    print(res_i)
print("#" * 128)

"""
Default correction (list input):
{'index': 0, 'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好地跳舞', 'errors': [['的', '地', 12, 0.6584], ['无', '舞', 14, 1.0]]}
{'index': 1, 'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让坐', 'errors': [['因', '应', 4, 0.995]]}
{'index': 2, 'source': '机七学习是人工智能领遇最能体现智能的一个分知', 'target': '机器学习是人工智能领域最能体现智能的一个分支', 'errors': [['七', '器', 1, 0.9998], ['遇', '域', 10, 0.9999], ['知', '支', 21, 1.0]]}
{'index': 3, 'source': '一只小鱼船浮在平净的河面上', 'target': '一只小鱼船浮在平静的河面上', 'errors': [['净', '静', 8, 0.9961]]}
"""

Using transformers

# !/usr/bin/python
# -*- coding: utf-8 -*-
# @time    : 2021/2/29 21:41
# @author  : Mo
# @function: transformers directly load bert - type models for testing


import traceback
import time
import sys
import os
os.environ["USE_TORCH"] = "1"
from transformers import BertConfig, BertTokenizer, BertForMaskedLM
import torch

# pretrained_model_name_or_path = "shibing624/macbert4csc-base-chinese"
# pretrained_model_name_or_path = "Macropodus/macbert4mdcspell_v1"
# pretrained_model_name_or_path = "Macropodus/macbert4csc_v1"
pretrained_model_name_or_path = "Macropodus/macbert4csc_v2"
# pretrained_model_name_or_path = "Macropodus/bert4csc_v1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_len = 128

print("load model, please wait a few minute!")
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)
bert_config = BertConfig.from_pretrained(pretrained_model_name_or_path)
model = BertForMaskedLM.from_pretrained(pretrained_model_name_or_path)
model.to(device)
print("load model success!")

texts = [
    "机七学习是人工智能领遇最能体现智能的一个分知",
    "我是练习时长两念半的鸽仁练习生蔡徐坤",
    "真麻烦你了。希望你们好好的跳无",
    "他法语说的很好，的语也不错",
    "遇到一位很棒的奴生跟我疗天",
    "我们为这个目标努力不解",
]
len_mid = min(max_len, max([len(t)+2 for t in texts]))

with torch.no_grad():
    outputs = model(**tokenizer(texts, padding=True, max_length=len_mid,
                                return_tensors="pt").to(device))

def get_errors(source, target):
    """   Minimal method to get errors   """
    len_min = min(len(source), len(target))
    errors = []
    for idx in range(len_min):
        if source[idx] != target[idx]:
            errors.append([source[idx], target[idx], idx])
    return errors

result = []
for probs, source in zip(outputs.logits, texts):
    ids = torch.argmax(probs, dim=-1)
    tokens_space = tokenizer.decode(ids[1:-1], skip_special_tokens=False)
    text_new = tokens_space.replace(" ", "")
    target = text_new[:len(source)]
    errors = get_errors(source, target)
    print(source, " => ", target, errors)
    result.append([target, errors])
print(result)
"""
机七学习是人工智能领遇最能体现智能的一个分知  =>  机器学习是人工智能领域最能体现智能的一个分支 [['七', '器', 1], ['遇', '域', 10], ['知', '支', 21]]
我是练习时长两念半的鸽仁练习生蔡徐坤  =>  我是练习时长两年半的个人练习生蔡徐坤 [['念', '年', 7], ['鸽', '个', 10], ['仁', '人', 11]]
真麻烦你了。希望你们好好的跳无  =>  真麻烦你了。希望你们好好地跳舞 [['的', '地', 12], ['无', '舞', 14]]
他法语说的很好，的语也不错  =>  他法语说得很好，德语也不错 [['的', '得', 4], ['的', '德', 8]]
遇到一位很棒的奴生跟我疗天  =>  遇到一位很棒的女生跟我聊天 [['奴', '女', 7], ['疗', '聊', 11]]
我们为这个目标努力不解  =>  我们为这个目标努力不懈 [['解', '懈', 10]]
"""

📚 Documentation

I. Test

1.1 Test Data Source

The address is Macropodus/csc_eval_public. All training data comes from the public network or open - source data. The training data is about 10 million, and the confusion dictionary is large.

1. gen_de3.json(5545): Correction of '的地得', manually generated from high - quality data such as People's Daily, Learning Power, and chinese - poetry.
2. lemon_v2.tet.json(1053): Data proposed by the relm paper, a multi - domain spelling correction dataset (7 domains), including game (GAM), encyclopedia (ENC), contract (COT), medical care (MEC), car (CAR), novel (NOV), and news (NEW).
3. acc_rmrb.tet.json(4636): From NER - 199801 (high - quality corpus of People's Daily).
4. acc_xxqg.tet.json(5000): From the high - quality corpus of the Learning Power website.
5. gen_passage.tet.json(10000): The source data is good words and sentences generated by qwen, generated by the confusion dictionary summarized from almost all open - source data.
6. textproof.tet.json(1447): NLP competition data, TextProofreadingCompetition.
7. gen_xxqg.tet.json(5000): The source data is the high - quality corpus of the Learning Power website, generated by the confusion dictionary summarized from almost all open - source data.
8. faspell.dev.json(1000): A dataset obtained after OCR of video subtitles, from the faspell paper of iQiyi.
9. lomo_tet.json(5000): Mainly a Chinese spelling correction dataset for similar - sounding words, from Tencent, an artificially annotated dataset CSCD - NS.
10. mcsc_tet.5000.json(5000): Medical spelling correction, from the real historical logs of Tencent Yidian APP. Note that the paper says this dataset only focuses on the correction of medical entities and does not focus on the correction of common characters.
11. ecspell.dev.json(1500): From the ECSpell paper, including three domains such as (law/med/gov).
12. sighan2013.dev.json(1000): From the sighan13 conference.
13. sighan2014.dev.json(1062): From the sighan14 conference.
14. sighan2015.dev.json(1100): From the sighan15 conference.

1.2 Test Data Preprocessing

The test data has undergone operations such as full - width to half - width conversion, simplified - traditional conversion, and punctuation standardization.

1.3 Other Notes

1. Indicators with 'common' are extremely lenient indicators, the same as the evaluation indicators of the open - source project pycorrector.
2. Indicators with 'strict' are extremely strict indicators, the same as the open - source project [wangwang110/CSC](https://github.com/wangwang110/CSC).
3. The macbert4mdcspell_v1 model uses the mdcspell architecture + bert's mlm - loss for training, but only uses bert - mlm during inference.
4. The acc_rmrb/acc_xxqg datasets have no errors and are used to evaluate the model's over - correction rate.
5. The model of qwen25_1 - 5b_pycorrector is shibing624/chinese - text - correction - 1.5b, and its training data includes the validation and test sets of lemon_v2/mcsc_tet/ecspell, while the training of other bert - type models does not include the validation and test sets.

II. Important Indicators

2.1 F1(common_cor_f1)

model/common_cor_f1	avg	gen_de3	lemon_v2	gen_passage	text_proof	gen_xxqg	faspell	lomo_tet	mcsc_tet	ecspell	sighan2013	sighan2014	sighan2015
macbert4csc_pycorrector	45.8	42.44	42.89	31.49	46.31	26.06	32.7	44.83	27.93	55.51	70.89	61.72	66.81
bert4csc_v1	62.28	93.73	61.99	44.79	68.0	35.03	48.28	61.8	64.41	79.11	77.66	51.01	61.54
macbert4csc_v1	68.55	96.67	65.63	48.4	75.65	38.43	51.76	70.11	80.63	85.55	81.38	57.63	70.7
macbert4csc_v2	68.6	96.74	66.02	48.26	75.78	38.84	51.91	70.17	80.71	85.61	80.97	58.22	69.95
macbert4mdcspell_v1	71.1	96.42	70.06	52.55	79.61	43.37	53.85	70.9	82.38	87.46	84.2	61.08	71.32
qwen25_1 - 5b_pycorrector	45.11	27.29	89.48	14.61	83.9	13.84	18.2	36.71	96.29	88.2	36.41	15.64	20.73

2.2 acc(common_cor_acc)

model/common_cor_acc	avg	gen_de3	lemon_v2	gen_passage	text_proof	gen_xxqg	faspell	lomo_tet	mcsc_tet	ecspell	sighan2013	sighan2014	sighan2015
macbert4csc_pycorrector	48.26	26.96	28.68	34.16	55.29	28.38	22.2	60.96	57.16	67.73	55.9	68.93	72.73
bert4csc_v1	60.76	88.21	45.96	43.13	68.97	35.0	34.0	65.86	73.26	81.8	64.5	61.11	67.27
macbert4csc_v1	65.34	93.56	49.76	44.98	74.64	36.1	37.0	73.0	83.6	86.87	69.2	62.62	72.73
macbert4csc_v2	65.22	93.69	50.14	44.92	74.64	36.26	37.0	72.72	83.66	86.93	68.5	62.43	71.73
macbert4mdcspell_v1	67.15	93.09	54.8	47.71	78.09	39.52	38.8	71.92	84.78	88.27	73.2	63.28	72.36
qwen25_1 - 5b_pycorrector	46.09	15.82	81.29	22.96	82.17	19.04	12.8	50.2	96.4	89.13	22.8	27.87	32.55

2.3 acc(acc_true, thr = 0.75)

model/acc	avg	acc_rmrb	acc_xxqg
macbert4csc_pycorrector	99.24	99.22	99.26
bert4csc_v1	98.71	98.36	99.06
macbert4csc_v1	97.72	96.72	98.72
macbert4csc_v2	97.89	96.98	98.8
macbert4mdcspell_v1	97.75	96.51	98.98
qwen25_1 - 5b_pycorrector	82.0	77.14	86.86

II. Conclusion

1. Models such as macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1 are trained with data from multiple domains, which is relatively balanced and suitable as a pre - trained model for the first step, and can be used for further fine - tuning of proprietary domain data.
2. Comparing macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1 and observing Table 2.3, it can be found that the more training data, the higher the accuracy, but the over - correction rate will also be slightly higher.
3. MFT (Mask - Correct) is still effective, but the improvement is not obvious when the data volume is sufficient, which may also be an important reason for the increase in the over - correction rate.
4. There is also classical Chinese data in the training data, and the trained model also supports the correction of classical Chinese.
5. The trained model has a high recognition and correction rate for high - frequency errors such as '地得的'.

🔧 Technical Details

The model uses the macbert4csc architecture. During training, MFT is used, and the weight of det_loss is set. During inference, part of the model is discarded. The test data comes from multiple sources and has been pre - processed. The evaluation indicators are divided into lenient and strict types.

📄 License

The model is licensed under the Apache - 2.0 license.

📚 Papers

2024 - Refining: Refining Corpora from a Model Calibration Perspective for Chinese
2024 - ReLM: Chinese Spelling Correction as Rephrasing Language Model
2024 - DICS: DISC: Plug - and - Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check
2023 - Bi - DCSpell: A Bi - directional Detector - Corrector Interactive Framework for Chinese Spelling Check
2023 - BERT - MFT: Rethinking Masked Language Modeling for Chinese Spelling Correction
2023 - PTCSpell: PTCSpell: Pre - trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction
2023 - DR - CSC: [A Frustratingly Easy Plug - and - Play Detection - and - Reasoning Module for Chinese](https://aclanthology.org/2023.findings - emnlp.771)
2023 - DROM: Disentangled Phonetic Representation for Chinese Spelling Correction
2023 - EGCM: An Error - Guided Correction Model for Chinese Spelling Error Correction
2023 - IGPI: Investigating Glyph - Phonetic Information for Chinese Spell Checking: What Works and What’s Next?
2023 - CL: Contextual Similarity is More Valuable than Character Similarity - An Empirical Study for Chinese Spell Checking
2022 - CRASpell: [CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction](https://aclanthology.org/2022.findings - acl.237)
2022 - MDCSpell: [MDCSpell: A Multi - task Detector - Corrector Framework for Chinese Spelling Correction](https://aclanthology.org/2022.findings - acl.98)
2022 - SCOPE: Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity
2022 - ECOPO: The Past Mistake is the Future Wisdom: Error - driven Contrastive Probability Optimization for Chinese Spell Checking
2021 - MLMPhonetics: [Correcting Chinese Spelling Errors with Phonetic Pre - training](https://aclanthology.org/2021.findings - acl.198)
2021 - ChineseBERT: [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://aclanthology.org/2021.acl - long.161/)
2021 - BERTCrsGad: [Global Attention Decoder for Chinese Spelling Error Correction](https://aclanthology.org/2021.findings - acl.122)
2021 - ThinkTwice: [Think Twice: A Post - Processing Approach for the Chinese Spelling Error Correction](https://www.mdpi.com/2076 - 3417/11/13/5832)
2021 - PHMOSpell: [PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Chec](https://aclanthology.org/2021.acl - long.464)
2021 - SpellBERT: [SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check](https://aclanthology.org/2021.emnlp - main.287)
2021 - TwoWays: [Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models](https://aclanthology.org/2021.acl - short.56)
2021 - ReaLiSe: Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking
2021 - DCSpell: DCSpell: A Detector - Corrector Framework for Chinese Spelling Error Correction
2021 - PLOME: [PLOME: Pre - training with Misspelled Knowledge for Chinese Spelling Correction](https://aclanthology.org/2021.acl - long.233)
2021 - DCN: [Dynamic Connected Networks for Chinese Spelling Check](https://aclanthology.org/2021.findings - acl.216/)
2020 - SoftMaskBERT: Spelling Error Correction with Soft - Masked BERT
2020 - SpellGCN: SpellGCN：Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
2020 - ChunkCSC: [Chunk - based Chinese Spelling Check with Global Optimization](https://aclanthology.org/2020.findings - emnlp.184)
2020 - MacBERT: Revisiting Pre - Trained Models for Chinese Natural Language Processing
2019 - FASPell: [FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE - Decoder Paradigm](https://aclanthology.org/D19 - 5522)
2018 - Hybrid: [A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking](https://aclanthology.org/D18 - 1273)
2015 - Sighan15: [Introduction to SIGHAN 2015 Bake - off for Chinese Spelling Check](https://aclanthology.org/W15 - 3106/)
2014 - Sighan14: [Overview of SIGHAN 2014 Bake - off for Chinese Spelling Check](https://aclanthology.org/W14 - 6820/)
2013 - Sighan13: [Chinese Spelling Check Evaluation at SIGHAN Bake - off 2013](https://aclanthology.org/W13 - 4406/)

📚 References

[nghuyong/Chinese - text - correction - papers](https://github.com/nghuyong/Chinese - text - correction - papers)
destwang/CTCResources
wangwang110/CSC
[chinese - poetry/chinese - poetry](https://github.com/chinese - poetry/chinese - poetry)
[chinese - poetry/huajianji](https://github.com/chinese - poetry/huajianji)
garychowcmu/daizhigev20
yangjianxin1/Firefly
Macropodus/xuexiqiangguo_428w
Macropodus/csc_clean_wang271k
Macropodus/csc_eval_public
shibing624/pycorrector
iioSnail/MDCSpell_pytorch
gingasan/lemon
[Claude - Liu/ReLM](https://github.com/Claude - Liu/ReLM)

📚 Citation

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@software{macro - correct,
    url = {https://github.com/yongzhuo/macro - correct},
    author = {Yongzhuo Mo},
    title = {macro - correct},
    year = {2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご