roberta-base-word-chinese-cluecorpussmall Open-source Chinese Model - Word Segmentation Processing Improves Text Sequence Processing Efficiency

Roberta Base Word Chinese Cluecorpussmall

Developed by uer

A Chinese tokenized version of the RoBERTa medium model pre-trained on CLUECorpusSmall corpus, with tokenization processing to enhance sequence handling efficiency

Large Language Model Chinese#Tokenization Optimization #Chinese Pre-training #Multiple Sizes Available

Downloads 184

Release Time : 3/2/2022

Model Overview

This model is a tokenized version of the RoBERTa pre-trained model for Chinese, offering better performance and faster speed compared to character-level models, suitable for Chinese natural language processing tasks

Model Features

Tokenization Optimization

Utilizes sentencepiece tokenization technology to reduce sequence length and improve processing speed compared to character-level models

Multiple Sizes Available

Offers five different pre-trained model sizes ranging from Tiny to Base

Public Corpus

Trained on the publicly available CLUECorpusSmall corpus, ensuring reproducible results

Model Capabilities

Text Feature Extraction

Masked Language Prediction

Chinese Text Understanding

Use Cases

Text Completion

Transportation Information Completion

Completing queries like 'What time does the [MASK] to Beijing depart?'

Accurately predicts transportation methods such as 'flight' or 'high-speed rail'

Text Feature Extraction

Document Vectorization

Obtaining deep semantic representations of Chinese texts

Can be used for downstream classification, clustering, and other tasks

🚀 Chinese word-based RoBERTa Miniatures

This project offers a set of 5 Chinese word-based RoBERTa models. These models are pre - trained and have shown better performance in many Chinese language tasks compared to character - based models. They are trained on publicly available data, and all training details are provided for easy reproduction.

🚀 Quick Start

You can quickly start using these models through the HuggingFace platform. Here is a simple example of using the word - based RoBERTa - Medium model for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='uer/roberta-medium-word-chinese-cluecorpussmall')
>>> unmasker("[MASK]的首都是北京。")
[
    {'sequence': '中国 的首都是北京。',
     'score': 0.21525809168815613, 
     'token': 2873, 
     'token_str': '中国'}, 
    {'sequence': '北京 的首都是北京。', 
     'score': 0.15194718539714813, 
     'token': 9502, 
     'token_str': '北京'}, 
    {'sequence': '我们 的首都是北京。', 
     'score': 0.08854265511035919, 
     'token': 4215, 
     'token_str': '我们'},
    {'sequence': '美国 的首都是北京。', 
     'score': 0.06808705627918243, 
     'token': 7810, 
     'token_str': '美国'}, 
    {'sequence': '日本 的首都是北京。', 
     'score': 0.06071401759982109, 
     'token': 7788, 
     'token_str': '日本'}
]

✨ Features

Word - based: Compared with character - based models, word - based models are faster (due to shorter sequence lengths) and generally perform better.
Multiple Sizes: There are 5 different sizes of models available, including Tiny, Mini, Small, Medium, and Base, to meet different application requirements.
Public Data: The models are trained on publicly available data, and all training details are provided, making it easier for users to reproduce the results.

📦 Installation

There is no specific installation steps provided in the original README. If you want to use these models, you can install the necessary libraries through pip:

pip install transformers sentencepiece

💻 Usage Examples

Basic Usage

from transformers import AlbertTokenizer, BertModel
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
model = BertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Advanced Usage

from transformers import AlbertTokenizer, TFBertModel
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
model = TFBertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

Model Description

This is a set of 5 Chinese word - based RoBERTa models pre - trained by [UER - py](https://github.com/dbiir/UER - py/), as introduced in this paper. Additionally, the models can also be pre - trained by TencentPretrain introduced in this paper, which inherits UER - py to support models with parameters above one billion and extends it to a multimodal pre - training framework.

You can download the 5 Chinese RoBERTa miniatures either from the [UER - py Modelzoo page](https://github.com/dbiir/UER - py/wiki/Modelzoo), or via HuggingFace from the links below:

	Link
word - based RoBERTa - Tiny	L = 2/H = 128 (Tiny)
word - based RoBERTa - Mini	L = 4/H = 256 (Mini)
word - based RoBERTa - Small	L = 4/H = 512 (Small)
word - based RoBERTa - Medium	L = 8/H = 512 (Medium)
word - based RoBERTa - Base	L = 12/H = 768 (Base)

Performance Comparison

Compared with [char - based models](https://huggingface.co/uer/chinese_roberta_L - 2_H - 128), word - based models achieve better results in most cases. Here are scores on the development set of six Chinese tasks:

Model	Score	book_review	chnsenticorp	lcqmc	tnews(CLUE)	iflytek(CLUE)	ocnli(CLUE)
RoBERTa - Tiny(char)	72.3	83.4	91.4	81.8	62.0	55.0	60.3
RoBERTa - Tiny(word)	74.4(+2.1)	86.7	93.2	82.0	66.4	58.2	59.6
RoBERTa - Mini(char)	75.9	85.7	93.7	86.1	63.9	58.3	67.4
RoBERTa - Mini(word)	76.9(+1.0)	88.5	94.1	85.4	66.9	59.2	67.3
RoBERTa - Small(char)	76.9	87.5	93.4	86.5	65.1	59.4	69.7
RoBERTa - Small(word)	78.4(+1.5)	89.7	94.7	87.4	67.6	60.9	69.8
RoBERTa - Medium(char)	78.0	88.7	94.8	88.1	65.6	59.5	71.2
RoBERTa - Medium(word)	79.1(+1.1)	90.0	95.1	88.0	67.8	60.6	73.0
RoBERTa - Base(char)	79.7	90.1	95.2	89.2	67.0	60.9	75.5
RoBERTa - Base(word)	80.4(+0.7)	91.1	95.7	89.4	68.0	61.5	76.8

Training Data

CLUECorpusSmall is used as training data. Google's sentencepiece is used for word segmentation. The sentencepiece model is trained on the CLUECorpusSmall corpus:

>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.train(input='cluecorpussmall.txt',
             model_prefix='cluecorpussmall_spm',
             vocab_size=100000,
             max_sentence_length=1024,
             max_sentencepiece_length=6,
             user_defined_symbols=['[MASK]','[unused1]','[unused2]',
                '[unused3]','[unused4]','[unused5]','[unused6]',
                '[unused7]','[unused8]','[unused9]','[unused10]'],
             pad_id=0,
             pad_piece='[PAD]',
             unk_id=1,
             unk_piece='[UNK]',
             bos_id=2,
             bos_piece='[CLS]',
             eos_id=3,
             eos_piece='[SEP]',
             train_extremely_large_corpus=True
            )

Training Procedure

Models are pre - trained by [UER - py](https://github.com/dbiir/UER - py/) on Tencent Cloud. We pre - train 1,000,000 steps with a sequence length of 128 and then pre - train 250,000 additional steps with a sequence length of 512. We use the same hyper - parameters on different model sizes.

Taking the case of word - based RoBERTa - Medium:

Stage 1:

python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path cluecorpussmall_word_seq128_dataset.pt \
                      --processes_num 32 --seq_length 128 \
                      --dynamic_masking --data_processor mlm

python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \
                    --spm_model_path models/cluecorpussmall_spm.model \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e - 4 --batch_size 64 \
                    --data_processor mlm --target mlm

Stage 2:

python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path cluecorpussmall_word_seq512_dataset.pt \
                      --processes_num 32 --seq_length 512 \
                      --dynamic_masking --data_processor mlm

python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \
                    --spm_model_path models/cluecorpussmall_spm.model \
                    --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin - 1000000 \
                    --config_path models/bert/medium_config.json \
                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
                    --learning_rate 5e - 5 --batch_size 16 \
                    --data_processor mlm --target mlm

Finally, we convert the pre - trained model into Huggingface's format:

python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin - 250000 \
                                                        --output_model_path pytorch_model.bin \
                                                        --layers_num 8 --type mlm

🔧 Technical Details

Hyper - parameters for Fine - tuning

For each task, we selected the best fine - tuning hyper - parameters from the lists below, and trained with the sequence length of 128:

epochs: 3, 5, 8
batch sizes: 32, 64
learning rates: 3e - 5, 1e - 4, 3e - 4

📄 License

There is no license information provided in the original README, so this section is skipped.

BibTeX entry and citation info

@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご