gpt2-chinese-couplet開源對聯生成模型 - 免費生成傳統格式中文對聯

首頁

Gpt2 Chinese Couplet

由uer開發

基於GPT2架構的中文對聯生成模型，由UER-py框架預訓練，支持生成符合傳統對聯格式的中文文本。

文本生成中文#對聯生成 #中文古詩文 #GPT2微調

下載量 491

發布時間 : 3/2/2022

模型概述

該模型專門用於生成中文對聯，能夠根據上聯自動生成下聯，保持對仗工整和意境相符。

模型特點

專業對聯生成

專門針對中文對聯任務優化，能生成符合傳統對仗要求的對聯

基於大規模數據訓練

使用70萬副中文對聯數據進行訓練，覆蓋廣泛的主題和風格

多框架支持

支持UER-py和騰訊預訓練框架，便於在不同環境中使用

模型能力

中文對聯生成

文本自動補全

對仗工整的下聯創作

使用案例

文化創作

春節對聯創作

為春節等傳統節日自動生成吉祥對聯

生成的示例：'丹楓江冷人初去 - 黃葉聲從天外來閱旗'

文學創作輔助

幫助詩人或文學愛好者創作對仗工整的詩句

教育應用

傳統文化教學

用於中文教學中展示對聯的創作規則和技巧

🚀 中文對聯GPT2模型

本模型用於生成中文對聯，藉助預訓練技術，能夠根據上聯生成合適的下聯，為對聯創作提供便利。

🚀 快速開始

你可以使用文本生成管道直接調用該模型：

當參數 skip_special_tokens 為 True 時：

>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-couplet")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-couplet")
>>> text_generator = TextGenerationPipeline(model, tokenizer)   
>>> text_generator("[CLS]丹 楓 江 冷 人 初 去 -", max_length=25, do_sample=True)
    [{'generated_text': '[CLS]丹 楓 江 冷 人 初 去 - 黃 葉 聲 從 天 外 來 閱 旗'}]

當參數 skip_special_tokens 為 False 時：

>>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-couplet")
>>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-couplet")
>>> text_generator = TextGenerationPipeline(model, tokenizer)   
>>> text_generator("[CLS]丹 楓 江 冷 人 初 去 -", max_length=25, do_sample=True)
    [{'generated_text': '[CLS]丹 楓 江 冷 人 初 去 - 黃 葉 聲 我 酒 不 辭 [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]'}]

✨ 主要特性

本模型由 UER-py 進行預訓練，該工具在此論文中被介紹。此外，模型也可以通過 TencentPretrain 進行預訓練，相關內容見此論文。TencentPretrain 繼承了 UER-py，支持參數超過十億的模型，並將其擴展為多模態預訓練框架。
由於在 pipelines.py 中使用了參數 skip_special_tokens，像 [SEP]、[UNK] 這樣的特殊標記會被刪除，託管推理 API（右側）的輸出結果可能無法正確顯示。

📦 安裝指南

模型下載

你可以從以下途徑下載模型：

UER-py 模型庫頁面
GPT2-Chinese GitHub 頁面
通過 HuggingFace 從鏈接 gpt2-chinese-couplet 下載

訓練步驟

1. 數據預處理

python3 preprocess.py --corpus_path corpora/couplet.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path couplet_dataset.pt --processes_num 16 \
                      --seq_length 64 --data_processor lm

2. 預訓練模型

python3 pretrain.py --dataset_path couplet_dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/gpt2/config.json \
                    --output_model_path models/couplet_gpt2_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 25000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64

3. 轉換為 Huggingface 格式

python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path models/couplet_gpt2_model.bin-25000 \
                                                        --output_model_path pytorch_model.bin \
                                                        --layers_num 12

📚 詳細文檔

訓練數據

訓練數據包含 700,000 條中文對聯，這些對聯由 couplet-clean-dataset 收集。

引用信息

@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}
}

🔧 技術細節

本模型基於預訓練技術，使用特定的工具（UER-py 或 TencentPretrain）在騰訊雲平臺上進行訓練。通過對大量中文對聯數據的學習，模型能夠掌握對聯的語言規律和對仗規則，從而實現對聯的生成。在訓練過程中，設置了合適的序列長度、訓練步數、學習率等參數，以保證模型的性能和效果。同時，在使用模型時，參數 skip_special_tokens 的設置會影響輸出結果中特殊標記的顯示情況。