bart-base-chinese Open Source Model - Focused on Chinese Comprehension and Generation, Supporting Text-to-Text Creation

Bart Base Chinese

Developed by fnlp

A pre-trained asymmetric Transformer model for Chinese understanding and generation, supporting text-to-text generation tasks

Large Language Model

Transformers

Chinese#Chinese Text Generation #Sequence-to-Sequence #Long Text Processing

Downloads 6,504

Release Time : 3/2/2022

Model Overview

The Chinese BART Base Version is a sequence-to-sequence model based on the Transformer architecture, specifically optimized for Chinese text understanding and generation tasks. Through pre-training, it learns Chinese language representations and can handle various text generation tasks.

Model Features

Optimized Chinese Vocabulary

Uses a new vocabulary of 51,271 terms constructed from training data, completing missing Chinese characters and removing redundant tokens to enhance Chinese processing capabilities

Extended Position Encoding

Maximum position encoding length extended from 512 to 1024, supporting longer text sequences

Incremental Training Optimization

Inherits parameters from the old version through vocabulary alignment, with new parameters randomly initialized followed by 50,000 steps of incremental training

Model Capabilities

Chinese Text Generation

Text Summarization

Text Completion

Question Answering Generation

Use Cases

Text Generation

Capital Recognition

Identify and generate relationships between cities and national capitals

Input 'Beijing is the capital of [MASK]', output 'Beijing is the capital of China'

Text Summarization

Chinese Document Summarization

Generate concise summaries of Chinese documents

Achieved a ROUGE-L score of 61.51 in the CSL summarization task

🚀 Chinese BART-Base

An implementation of Chinese BART-Base for text2text generation tasks.

🚀 Quick Start

To quickly start using the Chinese BART-Base model, you can follow the usage example below.

✨ Features

News

12/30/2022

An updated version of CPT & Chinese BART are released. In the new version, the following parts are changed:

Vocabulary: The old BERT vocabulary is replaced with a larger one of size 51271 built from the training data. In this process, 1) over 6800 missing Chinese characters (most of them are traditional Chinese characters) are added; 2) redundant tokens (e.g., Chinese character tokens with the ## prefix) are removed; 3) some English tokens are added to reduce OOV.
Position Embeddings: The max_position_embeddings are extended from 512 to 1024.

The new version of the models is initialized with the old version of checkpoints with vocabulary alignment. Token embeddings found in the old checkpoints are copied, and other newly added parameters are randomly initialized. The new CPT & Chinese BART are further trained for 50K steps with a batch size of 2048, a max-seq-length of 1024, a peak learning rate of 2e-5, and a warmup ratio of 0.1.

The result compared to the previous checkpoints is as follows:

	AFQMC	IFLYTEK	CSL-sum	LCSTS	AVG
Previous
bart-base	73.0	60	62.1	37.8	58.23
cpt-base	75.1	60.5	63.0	38.2	59.20
bart-large	75.7	62.1	64.2	40.6	60.65
cpt-large	75.9	61.8	63.7	42.0	60.85
Updated
bart-base	73.03	61.25	61.51	38.78	58.64
cpt-base	74.40	61.23	62.09	38.81	59.13
bart-large	75.81	61.52	64.62	40.90	60.71
cpt-large	75.97	61.63	63.83	42.08	60.88

The result shows that the updated models maintain comparative performance compared with previous checkpoints. There are still some cases where the updated model is slightly worse than the previous one, which results from the following reasons: 1) Training for a few additional steps did not lead to significant performance improvement; 2) some downstream tasks are not affected by the newly added tokens and longer encoding sequences but are sensitive to the fine-tuning hyperparameters.

⚠️ Important Note

To use the updated models, please update the modeling_cpt.py (new version download Here) and the vocabulary (refresh the cache).

📚 Documentation

Model description

This is an implementation of Chinese BART-Base.

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, Xipeng Qiu

Github Link: https://github.com/fastnlp/CPT

Usage

>>> from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("fnlp/bart-base-chinese")
>>> model = BartForConditionalGeneration.from_pretrained("fnlp/bart-base-chinese")
>>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)  
>>> text2text_generator("北京是[MASK]的首都", max_length=50, do_sample=False)
    [{'generated_text': '北 京 是 中 国 的 首 都'}]

⚠️ Important Note

Please use BertTokenizer for the model vocabulary. DO NOT use the original BartTokenizer.

Citation

@article{shao2021cpt,
  title={CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation}, 
  author={Yunfan Shao and Zhichao Geng and Yitao Liu and Junqi Dai and Fei Yang and Li Zhe and Hujun Bao and Xipeng Qiu},
  journal={arXiv preprint arXiv:2109.05729},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご