Open-source roberta-small-word-chinese-cluecorpussmall model - Pretrained at the Chinese word level, delivering better task performance

Roberta Small Word Chinese Cluecorpussmall

Developed by uer

A Chinese word-level RoBERTa medium model pretrained on CLUECorpusSmall, outperforming character-level models in multiple tasks

Large Language Model Chinese#Word-level Pretraining #Short Sequence Optimization #Enhanced Chinese NLP

Downloads 33

Release Time : 3/2/2022

Model Overview

This model is a Chinese word-level RoBERTa pretrained model with a medium-scale architecture (8 layers/512 hidden dimensions), trained on CLUECorpusSmall corpus, suitable for various Chinese natural language processing tasks.

Model Features

Word-level Tokenization Advantage

Compared to character-level models, word-level processing results in shorter sequences, faster speed, and better performance in multiple tasks

Multiple Size Options

Offers 5 different scales of pretrained models from Tiny to Base

Open Training Process

Uses publicly available corpus and tokenization tools, with complete training details provided for reproducibility

Model Capabilities

Chinese Text Understanding

Masked Word Prediction

Text Feature Extraction

Downstream Task Fine-tuning

Use Cases

Text Classification

Sentiment Analysis

Used for determining sentiment tendencies in product reviews or social media texts

Achieved 95.1% accuracy in Chinese sentiment analysis tasks

News Classification

Automatically categorizing news articles by topic

Achieved 67.8% accuracy in CLUE news classification tasks

Text Matching

QA Systems

Determining relevance between questions and candidate answers

Achieved 88.0% accuracy in text matching tasks

🚀 Chinese word-based RoBERTa Miniatures

This project offers a set of 5 Chinese word-based RoBERTa models, which are faster and perform better than character-based models according to experimental results.

🚀 Quick Start

You can directly use these models with a pipeline for masked language modeling. For example, take the word-based RoBERTa-Medium:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='uer/roberta-medium-word-chinese-cluecorpussmall')
>>> unmasker("[MASK]的首都是北京。")
[
    {'sequence': '中国 的首都是北京。',
     'score': 0.21525809168815613, 
     'token': 2873, 
     'token_str': '中国'}, 
    {'sequence': '北京 的首都是北京。', 
     'score': 0.15194718539714813, 
     'token': 9502, 
     'token_str': '北京'}, 
    {'sequence': '我们 的首都是北京。', 
     'score': 0.08854265511035919, 
     'token': 4215, 
     'token_str': '我们'},
    {'sequence': '美国 的首都是北京。', 
     'score': 0.06808705627918243, 
     'token': 7810, 
     'token_str': '美国'}, 
    {'sequence': '日本 的首都是北京。', 
     'score': 0.06071401759982109, 
     'token': 7788, 
     'token_str': '日本'}
]

✨ Features

Word-based Advantage: Most Chinese pre-trained weights are character-based. In contrast, word-based models are faster due to shorter sequence lengths and perform better according to experimental results.
Multiple Sizes: Five Chinese word-based RoBERTa models of different sizes are released to meet various needs.
Reproducibility: Publicly available corpus and word segmentation tools are used, and all training details are provided to facilitate users in reproducing the results.

💻 Usage Examples

Basic Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AlbertTokenizer, BertModel
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
model = BertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Advanced Usage

Here is the usage in TensorFlow:

from transformers import AlbertTokenizer, TFBertModel
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
model = TFBertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Since BertTokenizer does not support sentencepiece, AlbertTokenizer is used here.

📚 Documentation

Model description

This is the set of 5 Chinese word-based RoBERTa models pre-trained by UER-py, which is introduced in this paper. Besides, the models could also be pre-trained by TencentPretrain introduced in this paper, which inherits UER-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework.

You can download the 5 Chinese RoBERTa miniatures either from the UER-py Modelzoo page, or via HuggingFace from the links below:

	Link
word-based RoBERTa-Tiny	L=2/H=128 (Tiny)
word-based RoBERTa-Mini	L=4/H=256 (Mini)
word-based RoBERTa-Small	L=4/H=512 (Small)
word-based RoBERTa-Medium	L=8/H=512 (Medium)
word-based RoBERTa-Base	L=12/H=768 (Base)

Compared with char-based models, word-based models achieve better results in most cases. Here are scores on the devlopment set of six Chinese tasks:

Model	Score	book_review	chnsenticorp	lcqmc	tnews(CLUE)	iflytek(CLUE)	ocnli(CLUE)
RoBERTa-Tiny(char)	72.3	83.4	91.4	81.8	62.0	55.0	60.3
RoBERTa-Tiny(word)	74.4(+2.1)	86.7	93.2	82.0	66.4	58.2	59.6
RoBERTa-Mini(char)	75.9	85.7	93.7	86.1	63.9	58.3	67.4
RoBERTa-Mini(word)	76.9(+1.0)	88.5	94.1	85.4	66.9	59.2	67.3
RoBERTa-Small(char)	76.9	87.5	93.4	86.5	65.1	59.4	69.7
RoBERTa-Small(word)	78.4(+1.5)	89.7	94.7	87.4	67.6	60.9	69.8
RoBERTa-Medium(char)	78.0	88.7	94.8	88.1	65.6	59.5	71.2
RoBERTa-Medium(word)	79.1(+1.1)	90.0	95.1	88.0	67.8	60.6	73.0
RoBERTa-Base(char)	79.7	90.1	95.2	89.2	67.0	60.9	75.5
RoBERTa-Base(word)	80.4(+0.7)	91.1	95.7	89.4	68.0	61.5	76.8

For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:

epochs: 3, 5, 8
batch sizes: 32, 64
learning rates: 3e-5, 1e-4, 3e-4

Training data

CLUECorpusSmall is used as training data. Google's sentencepiece is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:

>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.train(input='cluecorpussmall.txt',
             model_prefix='cluecorpussmall_spm',
             vocab_size=100000,
             max_sentence_length=1024,
             max_sentencepiece_length=6,
             user_defined_symbols=['[MASK]','[unused1]','[unused2]',
                '[unused3]','[unused4]','[unused5]','[unused6]',
                '[unused7]','[unused8]','[unused9]','[unused10]'],
             pad_id=0,
             pad_piece='[PAD]',
             unk_id=1,
             unk_piece='[UNK]',
             bos_id=2,
             bos_piece='[CLS]',
             eos_id=3,
             eos_piece='[SEP]',
             train_extremely_large_corpus=True
            )

Training procedure

Models are pre-trained by UER-py on Tencent Cloud. We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.

Taking the case of word-based RoBERTa-Medium

Stage1:

python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path cluecorpussmall_word_seq128_dataset.pt \
                      --processes_num 32 --seq_length 128 \
                      --dynamic_masking --data_processor mlm

python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \
                    --spm_model_path models/cluecorpussmall_spm.model \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e-4 --batch_size 64 \
                    --data_processor mlm --target mlm

Stage2:

python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path cluecorpussmall_word_seq512_dataset.pt \
                      --processes_num 32 --seq_length 512 \
                      --dynamic_masking --data_processor mlm

python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \
                    --spm_model_path models/cluecorpussmall_spm.model \
                    --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
                    --learning_rate 5e-5 --batch_size 16 \
                    --data_processor mlm --target mlm

Finally, we convert the pre-trained model into Huggingface's format:

python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin-250000 \
                                                        --output_model_path pytorch_model.bin \
                                                        --layers_num 8 --type mlm

BibTeX entry and citation info

@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご