macbert4csc-base-chinese: An Open-Source Chinese Spelling Correction Model Reaching Optimal Level on SIGHAN2015 Test

Home

Macbert4csc Base Chinese

Developed by shibing624

MacBERT-based Chinese spelling correction model, achieving state-of-the-art performance on the SIGHAN2015 test set

Large Language Model

Transformers

ChineseOpen Source License:Apache-2.0 #Chinese Spelling Correction #SIGHAN Best Model #MacBERT Architecture

Downloads 9,623

Release Time : 3/2/2022

Model Overview

This model focuses on detecting and correcting spelling errors in Chinese text, using an improved MacBERT architecture, suitable for various Chinese text proofreading scenarios

Model Features

Best Performance

Achieves character-level F1 score of 89.91 and sentence-level F1 score of 77.89 on the SIGHAN2015 test set, reaching current state-of-the-art performance

Improved Architecture

Improved MacBERT architecture based on softmaskedbert, optimizing model performance through MLM correction pre-training tasks

Comprehensive Training Data

Trained using SIGHAN+Wang271K Chinese correction dataset, containing 270,000 high-quality correction samples

Model Capabilities

Chinese Spelling Error Detection

Chinese Text Auto-correction

Typo Recognition and Correction

Use Cases

Text Proofreading

Daily Text Correction

Automatically corrects spelling errors in daily texts such as chats and emails

Example: 'Today my mood is good' → 'Today my mood is good'

Formal Document Proofreading

Assists in checking the accuracy of text in formal documents such as reports and papers

Educational Assistance

Chinese Learning Assistance

Helps Chinese learners identify and correct errors in their writing

🚀 MacBERT for Chinese Spelling Correction (macbert4csc) Model

A model for Chinese spelling correction

The macbert4csc-base-chinese model evaluated on the SIGHAN2015 test data shows the following results:

	Correct-Precision	Correct-Recall	Correct-F1
Character-level	93.72	86.40	89.91
Sentence-level	82.64	73.66	77.89

Since the training data uses the SIGHAN2015 training set (to reproduce the paper), the model achieves the SOTA level on the SIGHAN2015 test set.

The model structure is modified from softmaskedbert:

arch

🚀 Quick Start

✨ Features

High accuracy in Chinese spelling correction.
Based on the MacBERT architecture with innovative pre - training tasks.

📦 Installation

This project is part of the Chinese text correction project pycorrector. You can install relevant dependencies according to the project's requirements.

💻 Usage Examples

Basic Usage

This project is open - sourced in the Chinese text correction project: pycorrector, which supports the macbert4csc model. You can call it using the following command:

from pycorrector.macbert.macbert_corrector import MacBertCorrector

m = MacBertCorrector("shibing624/macbert4csc-base-chinese")

i = m.correct('今天新情很好')
print(i)

Advanced Usage

Of course, you can also use transformers to call:

import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)

texts = ["今天新情很好", "你找到你最喜欢的工作，我也很高心。"]
with torch.no_grad():
    outputs = model(**tokenizer(texts, padding=True, return_tensors='pt').to(device))

def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            continue
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

result = []
for ids, text in zip(outputs.logits, texts):
    _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
    corrected_text = _text[:len(text)]
    corrected_text, details = get_errors(corrected_text, text)
    print(text, ' => ', corrected_text, details)
    result.append((corrected_text, details))
print(result)

Output:

今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
你找到你最喜欢的工作，我也很高心。  =>  你找到你最喜欢的工作，我也很高兴。 [('心', '兴', 15, 16)]

The model files are composed as follows:

macbert4csc-base-chinese
    ├── config.json
    ├── added_tokens.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

📚 Documentation

Training Datasets

SIGHAN + Wang271K Chinese Correction Dataset

Dataset	Corpus	Download Link	Compressed Package Size
`SIGHAN+Wang271K Chinese Correction Dataset`	SIGHAN+Wang271K (270,000 entries)	Baidu Netdisk (Password: 01b9)	106M
`Original SIGHAN Dataset`	SIGHAN13 14 15	Official csc.html	339K
`Original Wang271K Dataset`	Wang271K	Provided by Automatic-Corpus-Generation dimmywang	93M

The data format of the SIGHAN + Wang271K Chinese correction dataset is as follows:

[
    {
        "id": "B2-4029-3",
        "original_text": "晚间会听到嗓音，白天的时候大家都不会太在意，但是在睡觉的时候这嗓音成为大家的恶梦。",
        "wrong_ids": [
            5,
            31
        ],
        "correct_text": "晚间会听到噪音，白天的时候大家都不会太在意，但是在睡觉的时候这噪音成为大家的恶梦。"
    }
]

macbert4csc
    ├── config.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

If you need to train macbert4csc, please refer to https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert

🔧 Technical Details

MacBERT is an improved BERT with a novel MLM as correction pre - training task, which mitigates the discrepancy between pre - training and fine - tuning.

Here is an example of our pre - training task.

Task	Example
Original Sentence	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
Whole word masking	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N - gram masking	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
MLM as correction	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

Except for the new pre - training task, we also incorporate the following techniques.

Whole Word Masking (WWM)
N - gram masking
Sentence - Order Prediction (SOP)

Note that our MacBERT can be directly replaced with the original BERT as there is no difference in the main neural architecture.

For more technical details, please check our paper: Revisiting Pre - trained Models for Chinese Natural Language Processing

📄 License

This project is licensed under the Apache - 2.0 license.

Citation

@software{pycorrector,
  author = {Xu Ming},
  title = {pycorrector: Text Error Correction Tool},
  year = {2021},
  url = {https://github.com/shibing624/pycorrector},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご