ð CNMBert
A model for translating Chinese Pinyin abbreviations into full Chinese characters.
ð Quick Start
CNMBert is a model trained based on Chinese-BERT-wwm. By modifying its pre - training task, it is adapted to the Pinyin abbreviation translation task. It has achieved state - of - the - art performance compared to fine - tuned GPT models and GPT - 4o.

Github
âš Features
What are Pinyin Abbreviations?
Pinyin abbreviations are forms like:
"bhys" -> "äžå¥œææ" (Sorry)
"ys" -> "åç¥" (Genshin Impact)
These abbreviations use the initial letters of Chinese Pinyin to replace Chinese characters. If you are interested in Pinyin abbreviations, you can refer to this: 倧家䞺ä»ä¹äŒè®šå猩åïŒ - è¿æ¹éæšçåç - ç¥ä¹
CNMBert Model Comparison
Property |
Details |
Model Type |
CNMBert-Default, CNMBert-MoE |
Model Weights |
CNMBert-Default, CNMBert-MoE |
Memory Usage (FP16) |
CNMBert-Default: 0.4GB, CNMBert-MoE: 0.8GB |
Model Size |
CNMBert-Default: 131M, CNMBert-MoE: 329M |
QPS |
CNMBert-Default: 12.56, CNMBert-MoE: 3.20 |
MRR |
CNMBert-Default: 59.70, CNMBert-MoE: 61.53 |
Acc |
CNMBert-Default: 49.74, CNMBert-MoE: 51.86 |
- All models are trained on the same 2 million pieces of wiki and Zhihu corpora.
- QPS stands for queries per second (The performance is currently poor because the
predict
function is not rewritten in C).
- MRR is the mean reciprocal rank.
- Acc is the accuracy.
ð» Usage Examples
Basic Usage
from transformers import AutoTokenizer, BertConfig
from CustomBertModel import predict
from MoELayer import BertWwmMoE
Load the model:
tokenizer = AutoTokenizer.from_pretrained("Midsummra/CNMBert-MoE")
config = BertConfig.from_pretrained('Midsummra/CNMBert-MoE')
model = BertWwmMoE.from_pretrained('Midsummra/CNMBert-MoE', config=config).to('cuda')
Predict words:
print(predict("ææäž€åkq", "kq", model, tokenizer)[:5])
print(predict("å¿«å»ç»éçæ²çbå§", "b", model, tokenizer[:5]))
The output will be something like:
['åé±', 1.2056937473156175], ['åå', 0.05837443749364857], ['åŒå', 0.0483869208528063], ['å¯å', 0.03996622172280445], ['壿°', 0.037183335575008414]
['ç
', 1.6893256306648254], ['å§', 0.1642467901110649], ['å', 0.026976384222507477], ['å
', 0.021441461518406868], ['æ¥', 0.01396679226309061]
Advanced Usage
def predict(sentence: str,
predict_word: str,
model,
tokenizer,
top_k=8,
beam_size=16,
threshold=0.005,
fast_mode=True,
strict_mode=True):
def backtrack_predict(sentence: str,
predict_word: str,
model,
tokenizer,
top_k=10,
fast_mode=True,
strict_mode=True):
â ïž Important Note
Due to the auto - encoding nature of BERT, different orders of prediction for MASK will lead to different results. If fast_mode
is enabled, the input will be predicted both forward and backward, which can improve the accuracy by about 2%, but it will also bring greater performance overhead.
ð¡ Usage Tip
strict_mode
will check the input to determine whether it is a real Chinese word.
ð Documentation
How to Fine - Tune the Model
Please refer to TrainExample.ipynb. For the dataset format, just ensure that the first column of the CSV file is the corpus to be trained.
Q&A
Q: The accuracy of this model seems a bit low.
A: You can try setting fast_mode
and strict_mode
to False
. The model is pre - trained on a relatively small dataset (2 million), so it is normal for it to have insufficient generalization ability. You can fine - tune it on a larger dataset or in a more specific domain. The fine - tuning method is not very different from that of Chinese-BERT-wwm. You just need to replace DataCollactor
with DataCollatorForMultiMask
in CustomBertModel.py
.
Citation
If you are interested in the specific implementation of CNMBert, you can refer to:
@misc{feng2024cnmbertmodelhanyupinyin,
title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
author={Zishuo Feng and Feng Cao},
year={2024},
eprint={2411.11770},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.11770},
}
ð License
This project is licensed under the AGPL - 3.0 license.