CNMBert-MoE Open Source Pinyin Abbreviation Translation Model - Accurately Convert Pinyin Abbreviations into Text

Cnmbert MoE

Developed by Midsummra

CNMBert is a model specifically designed for translating Pinyin abbreviations, trained based on Chinese-BERT-wwm and adapted for Pinyin abbreviation translation tasks by modifying the pre-training tasks.

Large Language Model

Transformers

Chinese#Pinyin abbreviation translation #Multi-mask prediction #Chinese BERT optimization

Downloads 26

Release Time : 4/25/2025

Model Overview

This model is primarily used to convert Pinyin abbreviations into corresponding Chinese phrases, such as translating 'bhys' to '不好意思' (meaning 'sorry'). Compared to fine-tuned GPT models and GPT-4o, it achieves state-of-the-art performance.

Model Features

Multi-mask support

Adapted for Pinyin abbreviation translation tasks by modifying pre-training tasks, supporting multi-mask prediction.

High performance

Achieves state-of-the-art performance (SOTA) compared to fine-tuned GPT models and GPT-4o.

MoE support

Provides a version with MoE to enhance model performance.

Model Capabilities

Pinyin abbreviation translation

Multi-mask prediction

Chinese text processing

Use Cases

Social media

Pinyin abbreviation translation

Convert Pinyin abbreviations in social media into corresponding Chinese phrases.

For example, translating 'bhys' to '不好意思'.

Natural language processing

Text completion

Predict and complete Pinyin abbreviations in text.

For example, predicting 'kq' as '块钱' in the sentence '我有两千kq'.

🚀 CNMBert

A model for translating Chinese Pinyin abbreviations into full Chinese characters.

🚀 Quick Start

CNMBert is a model trained based on Chinese-BERT-wwm. By modifying its pre - training task, it is adapted to the Pinyin abbreviation translation task. It has achieved state - of - the - art performance compared to fine - tuned GPT models and GPT - 4o.

Github

✨ Features

What are Pinyin Abbreviations?

Pinyin abbreviations are forms like:

"bhys" -> "不好意思" (Sorry)

"ys" -> "原神" (Genshin Impact)

These abbreviations use the initial letters of Chinese Pinyin to replace Chinese characters. If you are interested in Pinyin abbreviations, you can refer to this: 大家为什么会讨厌缩写？ - 远方青木的回答 - 知乎

CNMBert Model Comparison

Property	Details
Model Type	CNMBert-Default, CNMBert-MoE
Model Weights	CNMBert-Default, CNMBert-MoE
Memory Usage (FP16)	CNMBert-Default: 0.4GB, CNMBert-MoE: 0.8GB
Model Size	CNMBert-Default: 131M, CNMBert-MoE: 329M
QPS	CNMBert-Default: 12.56, CNMBert-MoE: 3.20
MRR	CNMBert-Default: 59.70, CNMBert-MoE: 61.53
Acc	CNMBert-Default: 49.74, CNMBert-MoE: 51.86

All models are trained on the same 2 million pieces of wiki and Zhihu corpora.
QPS stands for queries per second (The performance is currently poor because the predict function is not rewritten in C).
MRR is the mean reciprocal rank.
Acc is the accuracy.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, BertConfig
from CustomBertModel import predict
from MoELayer import BertWwmMoE

Load the model:

# use CNMBert with MoE
# To use CNMBert without MoE, replace all "Midsummra/CNMBert-MoE" with "Midsummra/CNMBert" and use BertForMaskedLM instead of using BertWwmMoE
tokenizer = AutoTokenizer.from_pretrained("Midsummra/CNMBert-MoE")
config = BertConfig.from_pretrained('Midsummra/CNMBert-MoE')
model = BertWwmMoE.from_pretrained('Midsummra/CNMBert-MoE', config=config).to('cuda')

# model = BertForMaskedLM.from_pretrained('Midsummra/CNMBert').to('cuda')

Predict words:

print(predict("我有两千kq", "kq", model, tokenizer)[:5])
print(predict("快去给魔理沙看b吧", "b", model, tokenizer[:5]))

The output will be something like:

['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]

['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]

Advanced Usage

# The default predict function uses beam search
def predict(sentence: str, 
            predict_word: str,
            model,
            tokenizer,
            top_k=8,
            beam_size=16, # beam width
            threshold=0.005, # threshold
            fast_mode=True, # whether to use fast mode
            strict_mode=True): # whether to check the output results

# Use backtracking brute - force search without pruning
def backtrack_predict(sentence: str,
            predict_word: str,
            model,
            tokenizer,
            top_k=10,
            fast_mode=True,
            strict_mode=True):

⚠️ Important Note

Due to the auto - encoding nature of BERT, different orders of prediction for MASK will lead to different results. If fast_mode is enabled, the input will be predicted both forward and backward, which can improve the accuracy by about 2%, but it will also bring greater performance overhead.

💡 Usage Tip

strict_mode will check the input to determine whether it is a real Chinese word.

📚 Documentation

How to Fine - Tune the Model

Please refer to TrainExample.ipynb. For the dataset format, just ensure that the first column of the CSV file is the corpus to be trained.

Q&A

Q: The accuracy of this model seems a bit low. A: You can try setting fast_mode and strict_mode to False. The model is pre - trained on a relatively small dataset (2 million), so it is normal for it to have insufficient generalization ability. You can fine - tune it on a larger dataset or in a more specific domain. The fine - tuning method is not very different from that of Chinese-BERT-wwm. You just need to replace DataCollactor with DataCollatorForMultiMask in CustomBertModel.py.

Citation

If you are interested in the specific implementation of CNMBert, you can refer to:

@misc{feng2024cnmbertmodelhanyupinyin,
      title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task}, 
      author={Zishuo Feng and Feng Cao},
      year={2024},
      eprint={2411.11770},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.11770}, 
}

📄 License

This project is licensed under the AGPL - 3.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご