macbert4csc-base-chinese開源中文拼寫糾錯模型，SIGHAN2015測試達最優水平

首頁

Macbert4csc Base Chinese

由shibing624開發

基於MacBERT的中文拼寫糾錯模型，在SIGHAN2015測試集上達到當前最優水平

大型語言模型

Transformers

中文開源協議:Apache-2.0 #中文拼寫糾錯 #SIGHAN最優模型 #MacBERT架構

下載量 9,623

發布時間 : 3/2/2022

模型概述

該模型專注於中文文本的拼寫錯誤檢測與糾正，採用改進的MacBERT架構，適用於各類中文文本校對場景

模型特點

最優性能

在SIGHAN2015測試集上達到字符級別F1值89.91，句子級別F1值77.89的當前最優水平

改進架構

基於softmaskedbert改進的MacBERT架構，通過MLM校正預訓練任務優化模型性能

全面訓練數據

使用SIGHAN+Wang271K中文糾錯數據集訓練，包含27萬條高質量糾錯樣本

模型能力

中文拼寫錯誤檢測

中文文本自動糾正

錯別字識別與修正

使用案例

文本校對

日常文本糾錯

自動糾正聊天、郵件等日常文本中的拼寫錯誤

示例：'今天新情很好' → '今天心情很好'

正式文檔校對

輔助檢查報告、論文等正式文檔的文字準確性

教育輔助

中文學習輔助

幫助中文學習者識別和糾正寫作中的錯誤

🚀 MacBERT中文拼寫糾錯（macbert4csc）模型

macbert4csc是一款用於中文拼寫糾錯的模型，在中文文本糾錯場景中表現出色，能有效提升文本的準確性和質量。

macbert4csc-base-chinese 在SIGHAN2015測試數據上的評估結果如下：

	糾錯準確率	糾錯召回率	糾錯F1值
字符級別	93.72	86.40	89.91
句子級別	82.64	73.66	77.89

由於訓練使用的數據採用了SIGHAN2015的訓練集（復現論文），該模型在SIGHAN2015的測試集上達到了SOTA水平。

模型結構借鑑並改進於softmaskedbert，具體結構如下：

arch

🚀 快速開始

本項目開源在中文文本糾錯項目：pycorrector，支持macbert4csc模型，可通過如下方式調用。

💻 使用示例

基礎用法

使用pycorrector庫調用模型：

from pycorrector.macbert.macbert_corrector import MacBertCorrector

m = MacBertCorrector("shibing624/macbert4csc-base-chinese")

i = m.correct('今天新情很好')
print(i)

高級用法

使用transformers庫調用模型：

import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)

texts = ["今天新情很好", "你找到你最喜歡的工作，我也很高心。"]
with torch.no_grad():
    outputs = model(**tokenizer(texts, padding=True, return_tensors='pt').to(device))

def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            continue
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

result = []
for ids, text in zip(outputs.logits, texts):
    _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
    corrected_text = _text[:len(text)]
    corrected_text, details = get_errors(corrected_text, text)
    print(text, ' => ', corrected_text, details)
    result.append((corrected_text, details))
print(result)

輸出結果：

今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
你找到你最喜歡的工作，我也很高心。  =>  你找到你最喜歡的工作，我也很高興。 [('心', '興', 15, 16)]

模型文件組成

macbert4csc-base-chinese
    ├── config.json
    ├── added_tokens.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

📚 詳細文檔

訓練數據集

SIGHAN+Wang271K中文糾錯數據集

數據集	語料	下載鏈接	壓縮包大小
`SIGHAN+Wang271K中文糾錯數據集`	SIGHAN+Wang271K(27萬條)	百度網盤（密碼01b9）	106M
`原始SIGHAN數據集`	SIGHAN13 14 15	官方csc.html	339K
`原始Wang271K數據集`	Wang271K	Automatic-Corpus-Generation dimmywang提供	93M

SIGHAN+Wang271K中文糾錯數據集的數據格式如下：

[
    {
        "id": "B2-4029-3",
        "original_text": "晚間會聽到嗓音，白天的時候大家都不會太在意，但是在睡覺的時候這嗓音成為大家的惡夢。",
        "wrong_ids": [
            5,
            31
        ],
        "correct_text": "晚間會聽到噪音，白天的時候大家都不會太在意，但是在睡覺的時候這噪音成為大家的惡夢。"
    }
]

模型文件結構：

macbert4csc
    ├── config.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

如果需要訓練macbert4csc，請參考https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert

關於MacBERT

MacBERT 是一種改進的BERT模型，採用了新穎的將MLM作為糾錯的預訓練任務，緩解了預訓練和微調之間的差異。

以下是預訓練任務的一個示例：

任務	示例
原始句子	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
全詞掩碼	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N-gram掩碼	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
將MLM作為糾錯	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

除了新的預訓練任務，該模型還採用了以下技術：

全詞掩碼（Whole Word Masking，WWM）
N-gram掩碼
句子順序預測（Sentence-Order Prediction，SOP）

請注意，由於主要神經網絡架構沒有差異，我們的MacBERT可以直接替代原始的BERT。

更多技術細節，請參考論文：Revisiting Pre-trained Models for Chinese Natural Language Processing

📄 許可證

本項目採用Apache-2.0許可證。

📚 引用

@software{pycorrector,
  author = {Xu Ming},
  title = {pycorrector: Text Error Correction Tool},
  year = {2021},
  url = {https://github.com/shibing624/pycorrector},
}