đ HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture with the BERT-Base configuration (Devlin et al. 2018). This model aims to facilitate polarity analysis and emotion recognition in Hebrew text.
đ Quick Start
HeBert was trained on three datasets:
- A Hebrew version of OSCAR (Ortiz, 2019): Approximately 9.8 GB of data, containing 1 billion words and over 20.8 million sentences.
- A Hebrew dump of Wikipedia: Around 650 MB of data, including over 63 million words and 3.8 million sentences.
- Emotion UGC data was collected specifically for this study (described below).
We evaluated the model on emotion recognition and sentiment analysis for downstream tasks.
⨠Features
Emotion UGC Data Description
Our User-Generated Content (UGC) consists of comments written on articles collected from 3 major news sites between January 2020 and August 2020. The total data size is approximately 150 MB, including over 7 million words and 350K sentences.
4000 sentences were annotated by crowd members (3 - 10 annotators per sentence) for 8 emotions (anger, disgust, expectation, fear, happy, sadness, surprise, and trust) and overall sentiment/polarity.
To validate the annotation, we searched for an agreement between raters on the emotion in each sentence using Krippendorff's alpha (krippendorff, 1970). We retained sentences with an alpha > 0.7. Note that while there was a general agreement between raters about emotions like happiness, trust, and disgust, there were a few emotions with general disagreement, presumably due to the complexity of identifying them in the text (e.g., expectation and surprise).
Performance
Sentiment Analysis
|
Precision |
Recall |
F1-Score |
Natural |
0.83 |
0.56 |
0.67 |
Positive |
0.96 |
0.92 |
0.94 |
Negative |
0.97 |
0.99 |
0.98 |
Accuracy |
|
|
0.97 |
Macro Avg |
0.92 |
0.82 |
0.86 |
Weighted Avg |
0.96 |
0.97 |
0.96 |
đģ Usage Examples
For Masked-LM Model (Can be Fine-Tuned to Any Downstream Task)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
model = AutoModel.from_pretrained("avichr/heBERT")
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="avichr/heBERT",
tokenizer="avichr/heBERT"
)
fill_mask("××§×ר×× × ××§×× ××Ē [MASK] ××× × ×× × ×Š×ר ××ר.")
For Sentiment Classification Model (Polarity ONLY)
from transformers import AutoTokenizer, AutoModel, pipeline
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis")
model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
sentiment_analysis = pipeline(
"sentiment-analysis",
model="avichr/heBERT_sentiment_analysis",
tokenizer="avichr/heBERT_sentiment_analysis",
return_all_scores = True
)
>>> sentiment_analysis('×× × ××Ē××× ×× ××××× ××ר×××Ē ×Ļ×ר×××')
[[{'label': 'natural', 'score': 0.9978172183036804},
{'label': 'positive', 'score': 0.0014792329166084528},
{'label': 'negative', 'score': 0.0007035882445052266}]]
>>> sentiment_analysis('×§×¤× ×× ××ĸ××')
[[{'label': 'natural', 'score': 0.00047328314394690096},
{'label': 'positive', 'score': 0.9994067549705505},
{'label': 'negative', 'score': 0.00011996887042187154}]]
>>> sentiment_analysis('×× × ×× ×××× ××Ē ××ĸ×××')
[[{'label': 'natural', 'score': 9.214012970915064e-05},
{'label': 'positive', 'score': 8.876807987689972e-05},
{'label': 'negative', 'score': 0.9998190999031067}]]
Our model is also available on AWS! For more information, visit AWS' git
đ Documentation
Stay Tuned!
We are still working on our model and will update this page as we make progress. Note that we have only released sentiment analysis (polarity) at this point, and emotion detection will be released later.
Our git: https://github.com/avichaychriqui/HeBERT
Citation
If you used this model, please cite us as:
Chriqui, A., & Yahav, I. (2021). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909.
@article{chriqui2021hebert,
title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
author={Chriqui, Avihay and Yahav, Inbal},
journal={arXiv preprint arXiv:2102.01909},
year={2021}
}