GPT2-Chinese-Couplet Open-Source Couplet Generation Model - Freely Generate Chinese Couplets in Traditional Format

Gpt2 Chinese Couplet

Developed by uer

A Chinese couplet generation model based on the GPT2 architecture, pre-trained with the UER-py framework, capable of generating Chinese text that conforms to traditional couplet formats.

Text Generation Chinese#Couplet Generation #Chinese Classical Poetry #GPT2 Fine-tuning

Downloads 491

Release Time : 3/2/2022

Model Overview

This model is specifically designed for generating Chinese couplets, capable of automatically creating matching second lines based on given first lines while maintaining proper parallelism and contextual harmony.

Model Features

Professional Couplet Generation

Optimized specifically for Chinese couplet tasks, capable of generating couplets that meet traditional parallelism requirements.

Trained on Large-Scale Data

Trained with 700,000 Chinese couplet datasets, covering a wide range of themes and styles.

Multi-Framework Support

Supports UER-py and Tencent Pretrain frameworks for easy deployment in different environments.

Model Capabilities

Chinese Couplet Generation

Text Auto-Completion

Creation of Well-Paralleled Second Lines

Use Cases

Cultural Creation

Spring Festival Couplet Creation

Automatically generates auspicious couplets for traditional festivals like the Spring Festival.

Generated example: 'Red maple river cold people first depart - Yellow leaves sound from the sky outside the banner'

Literary Creation Assistance

Assists poets or literature enthusiasts in creating well-paralleled verses.

Educational Applications

Traditional Culture Teaching

Used in Chinese language education to demonstrate the rules and techniques of couplet creation.

🚀 Chinese Couplet GPT2 Model

This is a model for generating Chinese couplets, pre - trained using advanced pre - training frameworks.

🚀 Quick Start

You can quickly start using this model to generate Chinese couplets. The following will introduce the model's description, usage, training data, and training process.

✨ Features

Powerful Pre - training Frameworks: The model can be pre - trained by [UER - py](https://github.com/dbiir/UER - py/) and TencentPretrain, which are introduced in this paper and this paper respectively.
Convenient Download Channels: You can download the model from multiple sources, such as the [UER - py Modelzoo page](https://github.com/dbiir/UER - py/wiki/Modelzoo), [GPT2 - Chinese Github page](https://github.com/Morizeyao/GPT2 - Chinese), or via HuggingFace from the link [gpt2 - chinese - couplet](https://huggingface.co/uer/gpt2 - chinese - couplet).

📦 Installation

There is no specific installation process described in the original text. You can directly download the model from the provided sources.

💻 Usage Examples

Basic Usage

You can use the model directly with a pipeline for text generation. When the parameter skip_special_tokens is True:

>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-couplet")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-couplet")
>>> text_generator = TextGenerationPipeline(model, tokenizer)   
>>> text_generator("[CLS]丹 枫 江 冷 人 初 去 -", max_length=25, do_sample=True)
    [{'generated_text': '[CLS]丹 枫 江 冷 人 初 去 - 黄 叶 声 从 天 外 来 阅 旗'}]

Advanced Usage

When the parameter skip_special_tokens is False:

>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-couplet")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-couplet")
>>> text_generator = TextGenerationPipeline(model, tokenizer)   
>>> text_generator("[CLS]丹 枫 江 冷 人 初 去 -", max_length=25, do_sample=True)
    [{'generated_text': '[CLS]丹 枫 江 冷 人 初 去 - 黄 叶 声 我 酒 不 辞 [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]'}]

📚 Documentation

Model Description

The model is pre - trained by [UER - py](https://github.com/dbiir/UER - py/), which is introduced in this paper. Besides, the model could also be pre - trained by TencentPretrain introduced in this paper, which inherits UER - py to support models with parameters above one billion, and extends it to a multimodal pre - training framework.

The model is used to generate Chinese couplets. You can download the model from the [UER - py Modelzoo page](https://github.com/dbiir/UER - py/wiki/Modelzoo), or [GPT2 - Chinese Github page](https://github.com/Morizeyao/GPT2 - Chinese), or via HuggingFace from the link [gpt2 - chinese - couplet](https://huggingface.co/uer/gpt2 - chinese - couplet).

Since the parameter skip_special_tokens is used in the pipelines.py, special tokens such as [SEP], [UNK] will be deleted, the output results of Hosted inference API (right) may not be properly displayed.

Training Data

Training data contains 700,000 Chinese couplets which are collected by [couplet - clean - dataset](https://github.com/v - zich/couplet - clean - dataset).

Training Procedure

The model is pre - trained by [UER - py](https://github.com/dbiir/UER - py/) on Tencent Cloud. We pre - train 25,000 steps with a sequence length of 64.

python3 preprocess.py --corpus_path corpora/couplet.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path couplet_dataset.pt --processes_num 16 \
                      --seq_length 64 --data_processor lm

python3 pretrain.py --dataset_path couplet_dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/gpt2/config.json \
                    --output_model_path models/couplet_gpt2_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 25000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64

Finally, we convert the pre - trained model into Huggingface's format:

python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path models/couplet_gpt2_model.bin-25000 \
                                                        --output_model_path pytorch_model.bin \
                                                        --layers_num 12

BibTeX entry and citation info

@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}
}

📄 License

The original text does not provide license information, so this section is skipped.

🔧 Technical Details

The original text does not provide in - depth technical details, so this section is skipped.

⚠️ Important Note

Since the parameter skip_special_tokens is used in the pipelines.py, special tokens such as [SEP], [UNK] will be deleted, the output results of Hosted inference API (right) may not be properly displayed.

Property	Details
Model Type	A model for generating Chinese couplets, pre - trained by UER - py or TencentPretrain
Training Data	700,000 Chinese couplets collected by [couplet - clean - dataset](https://github.com/v - zich/couplet - clean - dataset)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご