đ HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture with the BERT-Base configuration (Devlin et al. 2018). This model aims to facilitate polarity analysis and emotion recognition in Hebrew text.
đ Quick Start
HeBERT offers different usage scenarios for various downstream tasks. You can use it for masked-LM, sentiment classification, and NER tasks. The following sections will guide you through how to use it.
⨠Features
Training Datasets
HeBERT was trained on three datasets:
- A Hebrew version of OSCAR (Ortiz, 2019): Approximately 9.8 GB of data, including 1 billion words and over 20.8 million sentences.
- A Hebrew dump of Wikipedia: Around 650 MB of data, including over 63 million words and 3.8 million sentences.
- Emotion UGC data that was collected for the purpose of this study (described below).
Emotion UGC Data Description
Our User Generated Content (UGC) consists of comments written on articles collected from 3 major news sites between January 2020 and August 2020. The total data size is about 150 MB, including over 7 million words and 350K sentences.
4000 sentences were annotated by crowd members (3 - 10 annotators per sentence) for 8 emotions (anger, disgust, expectation, fear, happy, sadness, surprise, and trust) and overall sentiment / polarity. To validate the annotation, we used Krippendorff's alpha (Krippendorff, 1970) to find an agreement between raters on the emotion in each sentence. We retained sentences with an alpha > 0.7. Note that while there was a general agreement among raters about emotions like happy, trust, and disgust, there was some disagreement about a few emotions, presumably due to the complexity of identifying them in the text (e.g., expectation and surprise).
đĻ Installation
There is no specific installation command provided in the original README. However, you need to have the transformers
library installed to use HeBERT. You can install it using pip install transformers
.
đģ Usage Examples
Basic Usage - Masked-LM Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
model = AutoModel.from_pretrained("avichr/heBERT")
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="avichr/heBERT",
tokenizer="avichr/heBERT"
)
fill_mask("××§×ר×× × ××§×× ××Ē [MASK] ××× × ×× × ×Š×ר ××ר.")
Advanced Usage - Sentiment Classification Model (Polarity ONLY)
from transformers import AutoTokenizer, AutoModel, pipeline
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis")
model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
sentiment_analysis = pipeline(
"sentiment-analysis",
model="avichr/heBERT_sentiment_analysis",
tokenizer="avichr/heBERT_sentiment_analysis",
return_all_scores = True
)
print(sentiment_analysis('×× × ××Ē××× ×× ××××× ××ר×××Ē ×Ļ×ר×××'))
print(sentiment_analysis('×§×¤× ×× ××ĸ××'))
print(sentiment_analysis('×× × ×× ×××× ××Ē ××ĸ×××'))
Our model is also available on AWS! For more information, visit AWS' git.
Advanced Usage - NER Model
from transformers import pipeline
NER = pipeline(
"token-classification",
model="avichr/heBERT_NER",
tokenizer="avichr/heBERT_NER",
)
NER('×××× ×××× ×××× ××רץ××× ××ĸ×ר××Ē ×Š××ר×׊×××')
đ Documentation
Citing the Model
If you use this model, please cite us as follows:
Chriqui, A., & Yahav, I. (2022). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. INFORMS Journal on Data Science, forthcoming.
@article{chriqui2021hebert,
title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
author={Chriqui, Avihay and Yahav, Inbal},
journal={INFORMS Journal on Data Science},
year={2022}
}
đ§ Future Plans
We are still working on our model and will update this page as we make progress. Note that currently, we have only released the sentiment analysis (polarity) model, and the emotion detection model will be released later.
Our GitHub repository: https://github.com/avichaychriqui/HeBERT