roberta-base-finetuned-dianping-chinese Open-Source Model - Multi-domain Chinese Sentiment Analysis and News Classification

Roberta Base Finetuned Dianping Chinese

Developed by uer

Includes 5 Chinese text classification models based on RoBERTa-Base, suitable for sentiment analysis and news classification tasks across different domains

Text Classification Chinese#Chinese Sentiment Analysis #News Topic Classification #E-commerce Review Classification

Downloads 10.99k

Release Time : 3/2/2022

Model Overview

This series of models is fine-tuned using the UER-py framework, specifically designed for Chinese text classification tasks, including sentiment polarity analysis and news topic classification

Model Features

Multi-domain Coverage

Includes 5 classification models for different domains, covering various scenarios such as e-commerce reviews and news classification

Efficient Fine-tuning

Based on pre-trained RoBERTa models with efficient fine-tuning, achieving excellent performance on multiple Chinese classification tasks

Easy to Use

Provides HuggingFace interface, allowing direct text classification via pipeline

Model Capabilities

Chinese Text Classification

Sentiment Polarity Analysis

News Topic Classification

User Review Analysis

Use Cases

E-commerce Analysis

JD.com Review Sentiment Analysis

Analyze the sentiment polarity (positive/negative) of JD.com product reviews

Provides both binary classification and full multi-class model options

News Classification

News Topic Classification

Classify news lead paragraphs by topic (e.g., politics, economy)

Supports both Phoenix News and China News classification systems

🚀 Chinese RoBERTa-Base Models for Text Classification

These Chinese RoBERTa-Base models are designed for text classification tasks, offering high performance and flexibility. They are fine - tuned on multiple Chinese text datasets, enabling accurate classification across various domains.

🚀 Quick Start

You can use this model directly with a pipeline for text classification. Here is an example using the roberta - base - finetuned - chinanews - chinese model:

>>> from transformers import AutoModelForSequenceClassification,AutoTokenizer,pipeline
>>> model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
>>> tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
>>> text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
>>> text_classification("北京上个月召开了两会")
    [{'label': 'mainland China politics', 'score': 0.7211663722991943}]

✨ Features

Fine - tuned on Multiple Datasets: These models are fine - tuned on 5 different Chinese text classification datasets, including user reviews and news articles.
Flexible Fine - tuning: Can be further fine - tuned using [UER - py](https://github.com/dbiir/UER - py/) or TencentPretrain.
HuggingFace Compatibility: Can be easily integrated into HuggingFace's ecosystem.

📦 Installation

There is no specific installation step provided in the original document. If you want to use these models, you need to have the necessary Python libraries such as transformers installed. You can install it via pip install transformers.

📚 Documentation

Model description

This is a set of 5 Chinese RoBERTa - Base classification models fine - tuned by [UER - py](https://github.com/dbiir/UER - py/), which is introduced in this paper. Additionally, the models can also be fine - tuned by TencentPretrain introduced in this paper. TencentPretrain inherits UER - py to support models with parameters above one billion and extends it to a multimodal pre - training framework.

You can download the 5 Chinese RoBERTa - Base classification models either from the [UER - py Modelzoo page](https://github.com/dbiir/UER - py/wiki/Modelzoo), or via HuggingFace from the links below:

Property	Details
Model Type	Chinese RoBERTa - Base classification models
Training Data	JD full, JD binary, Dianping, Ifeng, Chinanews datasets
Download Links
Dataset	Link
:-----------:	:-------------------------------------------------------:
JD full	roberta - base - finetuned - jd - full - chinese
JD binary	roberta - base - finetuned - jd - binary - chinese
Dianping	roberta - base - finetuned - dianping - chinese
Ifeng	roberta - base - finetuned - ifeng - chinese
Chinanews	roberta - base - finetuned - chinanews - chinese

Training data

5 Chinese text classification datasets are used. JD full, JD binary, and Dianping datasets consist of user reviews of different sentiment polarities. Ifeng and Chinanews consist of first paragraphs of news articles of different topic classes. They are collected by Glyph project and more details are discussed in the corresponding paper.

Training procedure

Models are fine - tuned by [UER - py](https://github.com/dbiir/UER - py/) on Tencent Cloud. We fine - tune three epochs with a sequence length of 512 on the basis of the pre - trained model [chinese_roberta_L - 12_H - 768](https://huggingface.co/uer/chinese_roberta_L - 12_H - 768). At the end of each epoch, the model is saved when the best performance on the development set is achieved. We use the same hyper - parameters on different models.

Taking the case of roberta - base - finetuned - chinanews - chinese:

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --train_path datasets/glyph/chinanews/train.tsv \
                                   --dev_path datasets/glyph/chinanews/dev.tsv \
                                   --output_model_path models/chinanews_classifier_model.bin \
                                   --learning_rate 3e-5 --epochs_num 3 --batch_size 32 --seq_length 512

Finally, we convert the pre - trained model into Huggingface's format:

python3 scripts/convert_bert_text_classification_from_uer_to_huggingface.py --input_model_path models/chinanews_classifier_model.bin \
                                                                            --output_model_path pytorch_model.bin \
                                                                            --layers_num 12

BibTeX entry and citation info

@article{liu2019roberta,
  title={Roberta: A robustly optimized bert pretraining approach},
  author={Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1907.11692},
  year={2019}
}

@article{zhang2017encoding,
  title={Which encoding is the best for text classification in chinese, english, japanese and korean?},
  author={Zhang, Xiang and LeCun, Yann},
  journal={arXiv preprint arXiv:1708.02657},
  year={2017}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご