macbert4csc_v2オープンソース中国語スペル訂正モデル - 多領域テキスト訂正の実用的な選択肢

ホーム

Macbert4csc V2

Macropodusによって開発

macbert4csc_v2は中国語のスペル訂正に使用されるモデルで、特定のアーキテクチャとトレーニング戦略を採用しており、複数の評価データセットで良好な結果を示し、様々な領域のテキスト訂正タスクに適しています。

大規模言語モデル

PyTorch

中国語オープンソースライセンス:Apache-2.0 #中国語のスペル訂正 #動的マスクトレーニング #多領域での適用

ダウンロード数 112

リリース時間 : 1/16/2025

モデル概要

このモデルは主に中国語のスペル訂正に使用され、文言文や「地得的」などの一般的な高頻度エラーを含む、様々な領域のテキスト訂正タスクをサポートします。

モデル特徴

特定のアーキテクチャ設計

BertForMaskedLMの後に新たにエラー検出ブランチ（分類タスク）を追加し、トレーニングと推論時に異なる戦略を採用しています。

効率的なトレーニング戦略

MFT（動的マスク：0.2の非エラートークン）を使用してトレーニングし、det_lossの重みは0.3です。

多領域での適用性

様々な領域のデータを使用してトレーニングされており、事前学習モデルとして適しており、専用領域のデータのさらなる微調整に使用できます。

文言文のサポート

トレーニングデータには文言文データが含まれており、文言文の訂正をサポートしています。

高頻度エラーの処理

「地得的」などの高頻度エラーに対して高い識別率と訂正率を持っています。

モデル能力

中国語テキストのスペル訂正

多領域テキストの訂正

文言文の訂正

高頻度エラーの識別

使用事例

汎用テキスト訂正

日常テキストの訂正

日常テキストのスペルエラーを訂正します。

例：「少先队员因该为老人让坐」 → 「少先队员应该为老人让坐」

専門分野の訂正

専門分野のテキストのスペルエラーを訂正します。

例：「机七学习是人工智能领遇最能体现智能的一个分知」 → 「机器学习是人工智能领域最能体现智能的一个分支」

特定のエラータイプの処理

「地得的」の訂正

中国語で一般的な「地得的」の使用エラーを専門的に処理します。

例：「希望你们好好的跳无」 → 「希望你们好好地跳舞」

🚀 macbert4csc_v2

macbert4csc_v2は、中国語のスペル訂正に使用されるモデルです。特定のアーキテクチャと訓練戦略を採用しており、複数の方法で呼び出すことができます。複数の評価データセットで良好な性能を発揮し、様々な分野のテキスト訂正タスクに適しています。

🚀 クイックスタート

このモデルは、中国語のスペル訂正評価とテキスト訂正に使用できます。重みの使用には独自の特徴があります。プロジェクトのアドレスはhttps://github.com/yongzhuo/macro-correct です。

このモデルの重みはmacbert4csc_v2で、macbert4cscアーキテクチャ（pycorrectorバージョン）を使用しています。その特徴は、BertForMaskedLMの後に新たにエラー検出タスク（分類タスク、非相互作用）用のブランチを追加したことです。訓練時にはMFT（動的に0.2の非エラートークンをマスク）を使用し、det_lossの重みは0.3です。推論時にはmacbertの後半部分（det-layer）を捨てます。

使用方法は2種類あります。

transformersを使用して呼び出す。
macro-correctプロジェクトを使用して呼び出す。詳細は***三、呼び出し(Usage)***を参照してください。

✨ 主な機能

特定のアーキテクチャ：BertForMaskedLMの後に新たにエラー検出タスク（分類タスク、非相互作用）用のブランチを追加しています。
訓練戦略：訓練時にはMFT（動的に0.2の非エラートークンをマスク）を使用し、det_lossの重みは0.3です。
推論の最適化：推論時にはmacbertの後半部分（det-layer）を捨てます。
多分野適用性：様々な分野のデータを使用して訓練されており、バランスが良く、最初の事前訓練モデルとして適しています。専門分野のデータでのさらなる微調整にも使用できます。
文言文対応：訓練データには文言文データが含まれており、訓練されたモデルは文言文の訂正にも対応しています。
高頻度エラー処理：「地得的」などの高頻度エラーに対して高い識別率と訂正率を持っています。

💻 使用例

基本的な使用法

macro-correctを使用する場合

import os
os.environ["MACRO_CORRECT_FLAG_CSC_TOKEN"] = "1"

from macro_correct import correct
### デフォルトの訂正（リスト入力）
text_list = ["真麻烦你了。希望你们好好的跳无",
             "少先队员因该为老人让坐",
             "机七学习是人工智能领遇最能体现智能的一个分知",
             "一只小鱼船浮在平净的河面上"
             ]
text_csc = correct(text_list)
print("デフォルトの訂正（リスト入力）:")
for res_i in text_csc:
    print(res_i)
print("#" * 128)

"""
デフォルトの訂正（リスト入力）:
{'index': 0, 'source': '真麻烦你了。希望你们好好的跳无', 'target': '真麻烦你了。希望你们好好地跳舞', 'errors': [['的', '地', 12, 0.6584], ['无', '舞', 14, 1.0]]}
{'index': 1, 'source': '少先队员因该为老人让坐', 'target': '少先队员应该为老人让坐', 'errors': [['因', '应', 4, 0.995]]}
{'index': 2, 'source': '机七学习是人工智能领遇最能体现智能的一个分知', 'target': '机器学习是人工智能领域最能体现智能的一个分支', 'errors': [['七', '器', 1, 0.9998], ['遇', '域', 10, 0.9999], ['知', '支', 21, 1.0]]}
{'index': 3, 'source': '一只小鱼船浮在平净的河面上', 'target': '一只小鱼船浮在平静的河面上', 'errors': [['净', '静', 8, 0.9961]]}
"""

transformersを使用する場合

# !/usr/bin/python
# -*- coding: utf-8 -*-
# @time    : 2021/2/29 21:41
# @author  : Mo
# @function: transformers直接加载bert类模型测试


import traceback
import time
import sys
import os
os.environ["USE_TORCH"] = "1"
from transformers import BertConfig, BertTokenizer, BertForMaskedLM
import torch

# pretrained_model_name_or_path = "shibing624/macbert4csc-base-chinese"
# pretrained_model_name_or_path = "Macropodus/macbert4mdcspell_v1"
# pretrained_model_name_or_path = "Macropodus/macbert4csc_v1"
pretrained_model_name_or_path = "Macropodus/macbert4csc_v2"
# pretrained_model_name_or_path = "Macropodus/bert4csc_v1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_len = 128

print("load model, please wait a few minute!")
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)
bert_config = BertConfig.from_pretrained(pretrained_model_name_or_path)
model = BertForMaskedLM.from_pretrained(pretrained_model_name_or_path)
model.to(device)
print("load model success!")

texts = [
    "机七学习是人工智能领遇最能体现智能的一个分知",
    "我是练习时长两念半的鸽仁练习生蔡徐坤",
    "真麻烦你了。希望你们好好的跳无",
    "他法语说的很好，的语也不错",
    "遇到一位很棒的奴生跟我疗天",
    "我们为这个目标努力不解",
]
len_mid = min(max_len, max([len(t)+2 for t in texts]))

with torch.no_grad():
    outputs = model(**tokenizer(texts, padding=True, max_length=len_mid,
                                return_tensors="pt").to(device))

def get_errors(source, target):
    """   极简方法获取 errors   """
    len_min = min(len(source), len(target))
    errors = []
    for idx in range(len_min):
        if source[idx] != target[idx]:
            errors.append([source[idx], target[idx], idx])
    return errors

result = []
for probs, source in zip(outputs.logits, texts):
    ids = torch.argmax(probs, dim=-1)
    tokens_space = tokenizer.decode(ids[1:-1], skip_special_tokens=False)
    text_new = tokens_space.replace(" ", "")
    target = text_new[:len(source)]
    errors = get_errors(source, target)
    print(source, " => ", target, errors)
    result.append([target, errors])
print(result)
"""
机七学习是人工智能领遇最能体现智能的一个分知  =>  机器学习是人工智能领域最能体现智能的一个分支 [['七', '器', 1], ['遇', '域', 10], ['知', '支', 21]]
我是练习时长两念半的鸽仁练习生蔡徐坤  =>  我是练习时长两年半的个人练习生蔡徐坤 [['念', '年', 7], ['鸽', '个', 10], ['仁', '人', 11]]
真麻烦你了。希望你们好好的跳无  =>  真麻烦你了。希望你们好好地跳舞 [['的', '地', 12], ['无', '舞', 14]]
他法语说的很好，的语也不错  =>  他法语说得很好，德语也不错 [['的', '得', 4], ['的', '德', 8]]
遇到一位很棒的奴生跟我疗天  =>  遇到一位很棒的女生跟我聊天 [['奴', '女', 7], ['疗', '聊', 11]]
我们为这个目标努力不解  =>  我们为这个目标努力不懈 [['解', '懈', 10]]
"""

📚 ドキュメント

一、評価(Test)

1.1 評価データのソース

アドレスはMacropodus/csc_eval_publicです。すべての訓練データは公開Webまたはオープンソースデータから取得されており、訓練データは約1千万件で、混同行事典は大きいです。

1.gen_de3.json(5545): '的地得'の訂正。人民日報/学習強国/chinese-poetryなどの高品質データから手動生成。
2.lemon_v2.tet.json(1053): relm論文で提案されたデータ。多分野のスペル訂正データセット（7つの分野）。game(GAM)、encyclopedia (ENC)、contract (COT)、medical care(MEC)、car (CAR)、novel (NOV)、news (NEW)などの分野を含む。
3.acc_rmrb.tet.json(4636): NER - 199801（人民日報の高品質コーパス）から。
4.acc_xxqg.tet.json(5000): 学習強国ウェブサイトの高品質コーパスから。
5.gen_passage.tet.json(10000): ソースデータはqwenで生成された良い言葉や文章で、ほぼすべてのオープンソースデータを集めた混同行事典から生成。
6.textproof.tet.json(1447): NLPコンペティションデータ、TextProofreadingCompetition。
7.gen_xxqg.tet.json(5000): ソースデータは学習強国ウェブサイトの高品質コーパスで、ほぼすべてのオープンソースデータを集めた混同行事典から生成。
8.faspell.dev.json(1000): ビデオ字幕をOCRで取得したデータセット。愛奇芸の論文faspellから。
9.lomo_tet.json(5000): 主に音似の中国語スペル訂正データセット。腾讯から。手動アノテーションされたデータセットCSCD - NS。
10.mcsc_tet.5000.json(5000): 医学的なスペル訂正。腾讯医典APPの実際の履歴ログから。論文によると、このデータセットは医学的なエンティティの訂正のみに焦点を当てており、一般的な文字の訂正は対象外です。
11.ecspell.dev.json(1500): ECSpell論文から。(law/med/gov)などの3つの分野を含む。
12.sighan2013.dev.json(1000): sighan13会議から。
13.sighan2014.dev.json(1062): sighan14会議から。
14.sighan2015.dev.json(1100): sighan15会議から。

1.2 評価データの前処理

評価データはすべて、全角から半角への変換、繁字体から簡字体への変換、句読点の標準化などの操作を行っています。

1.3 その他の説明

1.指標にcommonが含まれるものは非常に緩い指標で、オープンソースプロジェクトpycorrectorの評価指標と同じです。
2.指標にstrictが含まれるものは非常に厳しい指標で、オープンソースプロジェクト[wangwang110/CSC](https://github.com/wangwang110/CSC)と同じです。
3.macbert4mdcspell_v1モデルは、mdcspellアーキテクチャ + bertのmlm - lossを使用して訓練されていますが、推論時にはbert - mlmのみを使用します。
4.acc_rmrb/acc_xxqgデータセットにはエラーがなく、モデルの誤訂正率（過剰訂正）を評価するために使用されます。
5.qwen25_1 - 5b_pycorrectorのモデルはshibing624/chinese - text - correction - 1.5bで、その訓練データにはlemon_v2/mcsc_tet/ecspellの検証セットとテストセットが含まれています。他のbert系モデルの訓練には検証セットとテストセットは含まれていません。

二、重要な指標

2.1 F1(common_cor_f1)

model/common_cor_f1	avg	gen_de3	lemon_v2	gen_passage	text_proof	gen_xxqg	faspell	lomo_tet	mcsc_tet	ecspell	sighan2013	sighan2014	sighan2015
macbert4csc_pycorrector	45.8	42.44	42.89	31.49	46.31	26.06	32.7	44.83	27.93	55.51	70.89	61.72	66.81
bert4csc_v1	62.28	93.73	61.99	44.79	68.0	35.03	48.28	61.8	64.41	79.11	77.66	51.01	61.54
macbert4csc_v1	68.55	96.67	65.63	48.4	75.65	38.43	51.76	70.11	80.63	85.55	81.38	57.63	70.7
macbert4csc_v2	68.6	96.74	66.02	48.26	75.78	38.84	51.91	70.17	80.71	85.61	80.97	58.22	69.95
macbert4mdcspell_v1	71.1	96.42	70.06	52.55	79.61	43.37	53.85	70.9	82.38	87.46	84.2	61.08	71.32
qwen25_1 - 5b_pycorrector	45.11	27.29	89.48	14.61	83.9	13.84	18.2	36.71	96.29	88.2	36.41	15.64	20.73

2.2 acc(common_cor_acc)

model/common_cor_acc	avg	gen_de3	lemon_v2	gen_passage	text_proof	gen_xxqg	faspell	lomo_tet	mcsc_tet	ecspell	sighan2013	sighan2014	sighan2015
macbert4csc_pycorrector	48.26	26.96	28.68	34.16	55.29	28.38	22.2	60.96	57.16	67.73	55.9	68.93	72.73
bert4csc_v1	60.76	88.21	45.96	43.13	68.97	35.0	34.0	65.86	73.26	81.8	64.5	61.11	67.27
macbert4csc_v1	65.34	93.56	49.76	44.98	74.64	36.1	37.0	73.0	83.6	86.87	69.2	62.62	72.73
macbert4csc_v2	65.22	93.69	50.14	44.92	74.64	36.26	37.0	72.72	83.66	86.93	68.5	62.43	71.73
macbert4mdcspell_v1	67.15	93.09	54.8	47.71	78.09	39.52	38.8	71.92	84.78	88.27	73.2	63.28	72.36
qwen25_1 - 5b_pycorrector	46.09	15.82	81.29	22.96	82.17	19.04	12.8	50.2	96.4	89.13	22.8	27.87	32.55

2.3 acc(acc_true, thr = 0.75)

model/acc	avg	acc_rmrb	acc_xxqg
macbert4csc_pycorrector	99.24	99.22	99.26
bert4csc_v1	98.71	98.36	99.06
macbert4csc_v1	97.72	96.72	98.72
macbert4csc_v2	97.89	96.98	98.8
macbert4mdcspell_v1	97.75	96.51	98.98
qwen25_1 - 5b_pycorrector	82.0	77.14	86.86

二、結論(Conclusion)

1.macbert4csc_v1/macbert4csc_v2/macbert4mdcspell_v1などのモデルは、様々な分野のデータを使用して訓練されており、バランスが良く、最初の事前訓練モデルとして適しています。専門分野のデータでのさらなる微調整にも使用できます。
2.macbert4csc_pycorrector/bertbase4csc_v1/macbert4csc_v2/macbert4mdcspell_v1を比較すると、表2.3を見ると、訓練データが多いほど、正確率が向上する一方で、誤訂正率も少し高くなることがわかります。
3.MFT(Mask - Correct)は依然として有効ですが、データ量が十分な場合の改善は顕著ではなく、誤訂正率が上昇する重要な原因の1つかもしれません。
4.訓練データには文言文データも含まれており、訓練されたモデルは文言文の訂正にも対応しています。
5.訓練されたモデルは、「地得的」などの高頻度エラーに対して高い識別率と訂正率を持っています。

四、論文(Paper)

2024 - Refining: Refining Corpora from a Model Calibration Perspective for Chinese
2024 - ReLM: Chinese Spelling Correction as Rephrasing Language Model
2024 - DICS: DISC: Plug - and - Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check
2023 - Bi - DCSpell: A Bi - directional Detector - Corrector Interactive Framework for Chinese Spelling Check
2023 - BERT - MFT: Rethinking Masked Language Modeling for Chinese Spelling Correction
2023 - PTCSpell: PTCSpell: Pre - trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction
2023 - DR - CSC: [A Frustratingly Easy Plug - and - Play Detection - and - Reasoning Module for Chinese](https://aclanthology.org/2023.findings - emnlp.771)
2023 - DROM: Disentangled Phonetic Representation for Chinese Spelling Correction
2023 - EGCM: An Error - Guided Correction Model for Chinese Spelling Error Correction
2023 - IGPI: Investigating Glyph - Phonetic Information for Chinese Spell Checking: What Works and What’s Next?
2023 - CL: Contextual Similarity is More Valuable than Character Similarity - An Empirical Study for Chinese Spell Checking
2022 - CRASpell: [CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction](https://aclanthology.org/2022.findings - acl.237)
2022 - MDCSpell: [MDCSpell: A Multi - task Detector - Corrector Framework for Chinese Spelling Correction](https://aclanthology.org/2022.findings - acl.98)
2022 - SCOPE: Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity
2022 - ECOPO: The Past Mistake is the Future Wisdom: Error - driven Contrastive Probability Optimization for Chinese Spell Checking
2021 - MLMPhonetics: [Correcting Chinese Spelling Errors with Phonetic Pre - training](https://aclanthology.org/2021.findings - acl.198)
2021 - ChineseBERT: [ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://aclanthology.org/2021.acl - long.161/)
2021 - BERTCrsGad: [Global Attention Decoder for Chinese Spelling Error Correction](https://aclanthology.org/2021.findings - acl.122)
2021 - ThinkTwice: [Think Twice: A Post - Processing Approach for the Chinese Spelling Error Correction](https://www.mdpi.com/2076 - 3417/11/13/5832)
2021 - PHMOSpell: [PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Chec](https://aclanthology.org/2021.acl - long.464)
2021 - SpellBERT: [SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check](https://aclanthology.org/2021.emnlp - main.287)
2021 - TwoWays: [Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models](https://aclanthology.org/2021.acl - short.56)
2021 - ReaLiSe: Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking
2021 - DCSpell: DCSpell: A Detector - Corrector Framework for Chinese Spelling Error Correction
2021 - PLOME: [PLOME: Pre - training with Misspelled Knowledge for Chinese Spelling Correction](https://aclanthology.org/2021.acl - long.233)
2021 - DCN: [Dynamic Connected Networks for Chinese Spelling Check](https://aclanthology.org/2021.findings - acl.216/)
2020 - SoftMaskBERT: Spelling Error Correction with Soft - Masked BERT
2020 - SpellGCN: SpellGCN：Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
2020 - ChunkCSC: [Chunk - based Chinese Spelling Check with Global Optimization](https://aclanthology.org/2020.findings - emnlp.184)
2020 - MacBERT: Revisiting Pre - Trained Models for Chinese Natural Language Processing
2019 - FASPell: [FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE - Decoder Paradigm](https://aclanthology.org/D19 - 5522)
2018 - Hybrid: [A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking](https://aclanthology.org/D18 - 1273)
2015 - Sighan15: [Introduction to SIGHAN 2015 Bake - off for Chinese Spelling Check](https://aclanthology.org/W15 - 3106/)
2014 - Sighan14: [Overview of SIGHAN 2014 Bake - off for Chinese Spelling Check](https://aclanthology.org/W14 - 6820/)
2013 - Sighan13: [Chinese Spelling Check Evaluation at SIGHAN Bake - off 2013](https://aclanthology.org/W13 - 4406/)

五、参考(Refer)

[nghuyong/Chinese - text - correction - papers](https://github.com/nghuyong/Chinese - text - correction - papers)
destwang/CTCResources
wangwang110/CSC
[chinese - poetry/chinese - poetry](https://github.com/chinese - poetry/chinese - poetry)
[chinese - poetry/huajianji](https://github.com/chinese - poetry/huajianji)
garychowcmu/daizhigev20
yangjianxin1/Firefly
Macropodus/xuexiqiangguo_428w
Macropodus/csc_clean_wang271k
Macropodus/csc_eval_public
shibing624/pycorrector
iioSnail/MDCSpell_pytorch
gingasan/lemon
[Claude - Liu/ReLM](https://github.com/Claude - Liu/ReLM)

六、引用(Cite)

このプロジェクトを引用する場合は、現在のGitHubプロジェクトを参照してください。例えば、BibTeX形式を使用する場合は、次のようにします。

@software{macro-correct,
    url = {https://github.com/yongzhuo/macro-correct},
    author = {Yongzhuo Mo},
    title = {macro-correct},
    year = {2025}