CPT-Large Open Source Pretrained Model - Supports Chinese Understanding and Generation and Various Natural Language Processing Tasks

Cpt Large

Developed by fnlp

A pre-trained unbalanced Transformer model for Chinese understanding and generation, supporting various natural language processing tasks

Large Language Model

Transformers

Chinese#Chinese Text Generation #Long Sequence Processing #Unbalanced Transformer

Downloads 122

Release Time : 3/2/2022

Model Overview

CPT-Large is a Chinese pre-trained model based on the Transformer architecture, specifically optimized for Chinese text understanding and generation tasks. It adopts an unbalanced encoder-decoder structure, enhancing generation capabilities while maintaining BERT-style encoding abilities.

Model Features

Optimized Chinese Vocabulary

Adopts a new vocabulary of size 51271, supplementing 6800+ missing Chinese characters, removing redundant tokens, and adding English tokens to reduce out-of-vocabulary rates

Extended Position Encoding

Maximum position embeddings extended from 512 to 1024, supporting processing of longer text sequences

Unbalanced Architecture Design

Employs an unbalanced encoder-decoder structure to balance both text understanding and generation capabilities

Model Capabilities

Chinese Text Understanding

Chinese Text Generation

Text Classification

Summarization

Masked Language Modeling Prediction

Use Cases

Text Understanding

Text Classification

Classify Chinese texts, such as news categorization, sentiment analysis, etc.

Achieved 75.97 accuracy on the AFQMC task

Text Generation

Summarization

Automatically generate summaries for Chinese texts

Achieved a ROUGE-L score of 42.08 on the LCSTS dataset

Masked Language Modeling

Predict masked content in texts

Correctly predicted 'Beijing is the capital of China' in examples

🚀 Chinese CPT-Large

This project provides an implementation of CPT-Large, a pre - trained model for Chinese language understanding and generation. It has been updated with improvements in vocabulary and position embeddings.

🚀 Quick Start

News

12/30/2022

An updated version of CPT & Chinese BART are released. In the new version, the following parts have been changed:

Vocabulary: We replaced the old BERT vocabulary with a larger one of size 51271 built from the training data. In this new vocabulary, we 1) added over 6800 missing Chinese characters (most of them are traditional Chinese characters); 2) removed redundant tokens (e.g., Chinese character tokens with the ## prefix); 3) added some English tokens to reduce OOV.
Position Embeddings: We extended the max_position_embeddings from 512 to 1024.

We initialized the new version of models with the old version of checkpoints with vocabulary alignment. Token embeddings found in the old checkpoints were copied, and other newly added parameters were randomly initialized. We further trained the new CPT & Chinese BART for 50K steps with a batch size of 2048, a max - seq - length of 1024, a peak learning rate of 2e - 5, and a warmup ratio of 0.1.

The result compared to the previous checkpoints is as follows:

	AFQMC	IFLYTEK	CSL - sum	LCSTS	AVG
Previous
bart - base	73.0	60	62.1	37.8	58.23
cpt - base	75.1	60.5	63.0	38.2	59.20
bart - large	75.7	62.1	64.2	40.6	60.65
cpt - large	75.9	61.8	63.7	42.0	60.85
Updated
bart - base	73.03	61.25	61.51	38.78	58.64
cpt - base	74.40	61.23	62.09	38.81	59.13
bart - large	75.81	61.52	64.62	40.90	60.71
cpt - large	75.97	61.63	63.83	42.08	60.88

The result shows that the updated models maintain comparative performance compared with previous checkpoints. There are still some cases where the updated model is slightly worse than the previous one, which results from the following reasons: 1) Training for a few additional steps did not lead to significant performance improvement; 2) some downstream tasks are not affected by the newly added tokens and longer encoding sequences but are sensitive to the fine - tuning hyperparameters.

⚠️ Important Note

To use the updated models, please update the modeling_cpt.py (new version download Here) and the vocabulary (refresh the cache).

✨ Features

This model is suitable for multiple NLP tasks such as fill - mask, text2text - generation, text - classification, and summarization. It is designed specifically for the Chinese language, leveraging architectures like CPT, BART, and BERT.

📚 Documentation

Model description

This is an implementation of CPT - Large. To use CPT, please import the file modeling_cpt.py (Download Here) that defines the architecture of CPT into your project.

CPT: A Pre - Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, Xipeng Qiu

Github Link: https://github.com/fastnlp/CPT

💻 Usage Examples

Basic Usage

>>> from modeling_cpt import CPTForConditionalGeneration
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained("fnlp/cpt-large")
>>> model = CPTForConditionalGeneration.from_pretrained("fnlp/cpt-large")
>>> input_ids = tokenizer.encode("北京是[MASK]的首都", return_tensors='pt')
>>> pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
>>> print(tokenizer.convert_ids_to_tokens(pred_ids[0]))
    ['[SEP]', '[CLS]', '北', '京', '是', '中', '国', '的', '首', '都', '[SEP]']

⚠️ Important Note

Please use BertTokenizer for the model vocabulary. DO NOT use the original BartTokenizer.

📄 License

Citation

@article{shao2021cpt,
  title={CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation}, 
  author={Yunfan Shao and Zhichao Geng and Yitao Liu and Junqi Dai and Fei Yang and Li Zhe and Hujun Bao and Xipeng Qiu},
  journal={arXiv preprint arXiv:2109.05729},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご