🚀 Chinese CPT-Large
This project provides an implementation of CPT-Large, a pre - trained model for Chinese language understanding and generation. It has been updated with improvements in vocabulary and position embeddings.
🚀 Quick Start
News
12/30/2022
An updated version of CPT & Chinese BART are released. In the new version, the following parts have been changed:
- Vocabulary: We replaced the old BERT vocabulary with a larger one of size 51271 built from the training data. In this new vocabulary, we 1) added over 6800 missing Chinese characters (most of them are traditional Chinese characters); 2) removed redundant tokens (e.g., Chinese character tokens with the ## prefix); 3) added some English tokens to reduce OOV.
- Position Embeddings: We extended the max_position_embeddings from 512 to 1024.
We initialized the new version of models with the old version of checkpoints with vocabulary alignment. Token embeddings found in the old checkpoints were copied, and other newly added parameters were randomly initialized. We further trained the new CPT & Chinese BART for 50K steps with a batch size of 2048, a max - seq - length of 1024, a peak learning rate of 2e - 5, and a warmup ratio of 0.1.
The result compared to the previous checkpoints is as follows:
|
AFQMC |
IFLYTEK |
CSL - sum |
LCSTS |
AVG |
Previous |
|
|
|
|
|
bart - base |
73.0 |
60 |
62.1 |
37.8 |
58.23 |
cpt - base |
75.1 |
60.5 |
63.0 |
38.2 |
59.20 |
bart - large |
75.7 |
62.1 |
64.2 |
40.6 |
60.65 |
cpt - large |
75.9 |
61.8 |
63.7 |
42.0 |
60.85 |
Updated |
|
|
|
|
|
bart - base |
73.03 |
61.25 |
61.51 |
38.78 |
58.64 |
cpt - base |
74.40 |
61.23 |
62.09 |
38.81 |
59.13 |
bart - large |
75.81 |
61.52 |
64.62 |
40.90 |
60.71 |
cpt - large |
75.97 |
61.63 |
63.83 |
42.08 |
60.88 |
The result shows that the updated models maintain comparative performance compared with previous checkpoints. There are still some cases where the updated model is slightly worse than the previous one, which results from the following reasons: 1) Training for a few additional steps did not lead to significant performance improvement; 2) some downstream tasks are not affected by the newly added tokens and longer encoding sequences but are sensitive to the fine - tuning hyperparameters.
⚠️ Important Note
To use the updated models, please update the modeling_cpt.py
(new version download Here) and the vocabulary (refresh the cache).
✨ Features
This model is suitable for multiple NLP tasks such as fill - mask, text2text - generation, text - classification, and summarization. It is designed specifically for the Chinese language, leveraging architectures like CPT, BART, and BERT.
📚 Documentation
Model description
This is an implementation of CPT - Large. To use CPT, please import the file modeling_cpt.py
(Download Here) that defines the architecture of CPT into your project.
CPT: A Pre - Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, Xipeng Qiu
Github Link: https://github.com/fastnlp/CPT
💻 Usage Examples
Basic Usage
>>> from modeling_cpt import CPTForConditionalGeneration
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained("fnlp/cpt-large")
>>> model = CPTForConditionalGeneration.from_pretrained("fnlp/cpt-large")
>>> input_ids = tokenizer.encode("北京是[MASK]的首都", return_tensors='pt')
>>> pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
>>> print(tokenizer.convert_ids_to_tokens(pred_ids[0]))
['[SEP]', '[CLS]', '北', '京', '是', '中', '国', '的', '首', '都', '[SEP]']
⚠️ Important Note
Please use BertTokenizer for the model vocabulary. DO NOT use the original BartTokenizer.
📄 License
Citation
@article{shao2021cpt,
title={CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation},
author={Yunfan Shao and Zhichao Geng and Yitao Liu and Junqi Dai and Fei Yang and Li Zhe and Hujun Bao and Xipeng Qiu},
journal={arXiv preprint arXiv:2109.05729},
year={2021}
}