CPT-Base Open-Source Model - Free Support for Chinese Comprehension and Content Generation Tasks

Home

Cpt Base

Developed by fnlp

Asymmetric pre-trained Transformer model for Chinese comprehension and generation tasks

Large Language Model

Transformers

Chinese#Asymmetric Pre-training #Chinese Generation Optimization #Long Sequence Processing

Downloads 37

Release Time : 3/2/2022

Model Overview

CPT is a pre-trained model specifically designed for Chinese text processing, supporting various tasks such as text generation, classification, and summarization, with an optimized asymmetric Transformer architecture for enhanced Chinese processing.

Model Features

Optimized Chinese Vocabulary

Includes 51,271 lexical items, supplements 6,800+ missing Chinese characters, and removes redundant tokens, significantly reducing the out-of-vocabulary rate.

Long Sequence Support

Positional encoding extended to 1024 tokens, enhancing long-text processing capability.

Asymmetric Architecture

Encoder-decoder structure specifically optimized for Chinese comprehension and generation tasks.

Model Capabilities

Chinese Text Generation

Text Classification

Summarization

Masked Language Modeling

Sequence-to-Sequence Tasks

Use Cases

Text Generation

Automatic Summarization

Generates concise summaries from long texts

Achieves 38.81 ROUGE-L score on LCSTS dataset

Text Comprehension

Semantic Matching

Determines semantic relevance between sentence pairs

Achieves 74.4% accuracy on AFQMC task

🚀 Chinese CPT-Base

This is a pre - trained model for Chinese language understanding and generation, offering high - performance solutions for various NLP tasks.

🚀 Quick Start

To start using the Chinese CPT - Base model, you need to import the relevant files and initialize the model and tokenizer. Here is a simple example:

>>> from modeling_cpt import CPTForConditionalGeneration
>>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained("fnlp/cpt-base")
>>> model = CPTForConditionalGeneration.from_pretrained("fnlp/cpt-base")
>>> inputs = tokenizer.encode("北京是[MASK]的首都", return_tensors='pt')
>>> pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
>>> print(tokenizer.convert_ids_to_tokens(pred_ids[i]))
    ['[SEP]', '[CLS]', '北', '京', '是', '中', '国', '的', '首', '都', '[SEP]']

⚠️ Important Note

Please use BertTokenizer for the model vocabulary. DO NOT use original BartTokenizer.

✨ Features

News

12/30/2022

An updated version of CPT & Chinese BART are released. In the new version, the following parts are changed:

Vocabulary: We replace the old BERT vocabulary with a larger one of size 51271 built from the training data. In this new vocabulary, we 1) add over 6800 missing Chinese characters (most of them are traditional Chinese characters); 2) remove redundant tokens (e.g., Chinese character tokens with ## prefix); 3) add some English tokens to reduce OOV.
Position Embeddings: We extend the max_position_embeddings from 512 to 1024.

We initialize the new version of models with the old version of checkpoints with vocabulary alignment. Token embeddings found in the old checkpoints are copied, and other newly added parameters are randomly initialized. We further train the new CPT & Chinese BART 50K steps with batch size 2048, max - seq - length 1024, peak learning rate 2e - 5, and warmup ratio 0.1.

The result compared to the previous checkpoints is as follows:

	AFQMC	IFLYTEK	CSL - sum	LCSTS	AVG
Previous
bart - base	73.0	60	62.1	37.8	58.23
cpt - base	75.1	60.5	63.0	38.2	59.20
bart - large	75.7	62.1	64.2	40.6	60.65
cpt - large	75.9	61.8	63.7	42.0	60.85
Updated
bart - base	73.03	61.25	61.51	38.78	58.64
cpt - base	74.40	61.23	62.09	38.81	59.13
bart - large	75.81	61.52	64.62	40.90	60.71
cpt - large	75.97	61.63	63.83	42.08	60.88

The result shows that the updated models maintain comparative performance compared with previous checkpoints. There are still some cases where the updated model is slightly worse than the previous one, which results from the following reasons: 1) Training additional a few steps did not lead to significant performance improvement; 2) some downstream tasks are not affected by the newly added tokens and longer encoding sequences but are sensitive to the fine - tuning hyperparameters.

Note that to use updated models, please update the modeling_cpt.py (new version download Here) and the vocabulary (refresh the cache).

📚 Documentation

Model description

This is an implementation of CPT - Base. To use CPT, please import the file modeling_cpt.py (Download Here) that defines the architecture of CPT into your project.

CPT: A Pre - Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, Xipeng Qiu

Github Link: https://github.com/fastnlp/CPT

📄 License

Citation

@article{shao2021cpt,
  title={CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation}, 
  author={Yunfan Shao and Zhichao Geng and Yitao Liu and Junqi Dai and Fei Yang and Li Zhe and Hujun Bao and Xipeng Qiu},
  journal={arXiv preprint arXiv:2109.05729},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご