Chinese-macbert-large Open-source Chinese Model: Alleviating Training Inconsistency to Boost Text Processing

Chinese Macbert Large

Developed by hfl

MacBERT is an improved Chinese BERT model that employs M as a corrective masked language model pre-training task, alleviating the inconsistency between pre-training and fine-tuning stages.

Large Language Model ChineseOpen Source License:Apache-2.0 #Corrective Masked Language Model #Chinese pre-training optimization #Whole-word N-gram masking

Downloads 13.05k

Release Time : 3/2/2022

Model Overview

MacBERT is an improved Chinese BERT model that enhances performance in Chinese natural language processing tasks by using similar words for masking instead of traditional [MASK] tokens, combined with techniques like whole-word masking, N-gram masking, and sentence order prediction.

Model Features

Corrective MLM

Uses similar words for masking instead of [MASK] tokens, alleviating inconsistency between pre-training and fine-tuning stages

Whole-word masking

Employs whole-word masking to enhance the model's understanding of Chinese words

N-gram masking

Supports N-gram level masking to improve the model's comprehension of long texts

Sentence order prediction

Incorporates sentence order prediction tasks to enhance the model's understanding of text coherence

Model Capabilities

Chinese text understanding

Text classification

Named entity recognition

Question answering systems

Text similarity calculation

Use Cases

Natural Language Processing

Chinese text classification

Used for tasks like sentiment analysis and topic classification in Chinese text

Named entity recognition

Identifies entities such as person names, locations, and organizations in Chinese text

Question answering systems

Builds Chinese question answering systems to respond to text-based questions

🚀 MacBERT

MacBERT is an improved BERT with a novel MLM as correction pre - training task, mitigating the pre - training and fine - tuning discrepancy.

🚀 Quick Start

Please use 'Bert' related functions to load this model!

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published in "Findings of EMNLP". You can read our camera - ready paper through ACL Anthology or arXiv pre - print.

Revisiting Pre-trained Models for Chinese Natural Language Processing
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu

✨ Features

Model Introduction

MacBERT is an improved BERT with a novel MLM as correction pre - training task, which mitigates the discrepancy between pre - training and fine - tuning.

Instead of masking with [MASK] token, which never appears in the ﬁne - tuning stage, we propose to use similar words for the masking purpose. A similar word is obtained by using Synonyms toolkit (Wang and Hu, 2017), which is based on word2vec (Mikolov et al., 2013) similarity calculations. If an N - gram is selected to mask, we will ﬁnd similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.

Here is an example of our pre - training task:

Property	Details
Original Sentence	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
Whole word masking	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N - gram masking	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
MLM as correction	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

Additional Techniques

Except for the new pre - training task, we also incorporate the following techniques:

Whole Word Masking (WWM)
N - gram masking
Sentence - Order Prediction (SOP)

Note that our MacBERT can be directly replaced with the original BERT as there is no difference in the main neural architecture.

📚 Documentation

For more technical details, please check our paper: Revisiting Pre - trained Models for Chinese Natural Language Processing

📄 License

This project is licensed under the Apache - 2.0 license. You can view the license details here.

📖 Citation

If you find our resource or paper useful, please consider including the following citation in your paper:

https://arxiv.org/abs/2004.13922

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre - Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings - emnlp.58",
    pages = "657--668",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご