heBERT Open-Source Hebrew Language Model - Free Deployment for Polarity Analysis and Sentiment Recognition

Hebert

Developed by avichr

HeBERT is a pre-trained language model for Hebrew, based on the BERT architecture, focusing on polarity analysis and sentiment recognition tasks.

Large Language Model #Hebrew sentiment analysis #Multi-dimensional emotion recognition #News comment processing

Downloads 102.19k

Release Time : 3/2/2022

Model Overview

HeBERT is a pre-trained BERT model optimized for Hebrew, supporting tasks such as masked language modeling, sentiment classification, and named entity recognition, with particularly outstanding performance in sentiment analysis.

Model Features

Hebrew-specific optimization

Pre-trained specifically for Hebrew language characteristics, outperforming general multilingual models in Hebrew NLP tasks

High-quality sentiment annotation data

Uses crowdsourced annotation data validated by Krippendorff's alpha coefficient to ensure the reliability of sentiment labels

Multi-task support

The same architecture supports multiple downstream tasks such as masked prediction, sentiment analysis, and named entity recognition

Model Capabilities

Text sentiment analysis

Named entity recognition

Masked language modeling

Emotion classification

Use Cases

Social media analysis

News comment section sentiment monitoring

Analyzes sentiment tendencies in Hebrew news website comments

Can identify 8 basic emotions such as anger and happiness

Business intelligence

Hebrew product review analysis

Automatically classifies the sentiment polarity of user reviews

Provides positive/negative sentiment scores

🚀 HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition

HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture with the BERT-Base configuration (Devlin et al. 2018). This model aims to facilitate polarity analysis and emotion recognition in Hebrew text.

🚀 Quick Start

HeBERT offers different usage scenarios for various downstream tasks. You can use it for masked-LM, sentiment classification, and NER tasks. The following sections will guide you through how to use it.

✨ Features

Training Datasets

HeBERT was trained on three datasets:

A Hebrew version of OSCAR (Ortiz, 2019): Approximately 9.8 GB of data, including 1 billion words and over 20.8 million sentences.
A Hebrew dump of Wikipedia: Around 650 MB of data, including over 63 million words and 3.8 million sentences.
Emotion UGC data that was collected for the purpose of this study (described below).

Emotion UGC Data Description

Our User Generated Content (UGC) consists of comments written on articles collected from 3 major news sites between January 2020 and August 2020. The total data size is about 150 MB, including over 7 million words and 350K sentences. 4000 sentences were annotated by crowd members (3 - 10 annotators per sentence) for 8 emotions (anger, disgust, expectation, fear, happy, sadness, surprise, and trust) and overall sentiment / polarity. To validate the annotation, we used Krippendorff's alpha (Krippendorff, 1970) to find an agreement between raters on the emotion in each sentence. We retained sentences with an alpha > 0.7. Note that while there was a general agreement among raters about emotions like happy, trust, and disgust, there was some disagreement about a few emotions, presumably due to the complexity of identifying them in the text (e.g., expectation and surprise).

📦 Installation

There is no specific installation command provided in the original README. However, you need to have the transformers library installed to use HeBERT. You can install it using pip install transformers.

💻 Usage Examples

Basic Usage - Masked-LM Model

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
model = AutoModel.from_pretrained("avichr/heBERT")

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="avichr/heBERT",
    tokenizer="avichr/heBERT"
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")

Advanced Usage - Sentiment Classification Model (Polarity ONLY)

from transformers import AutoTokenizer, AutoModel, pipeline
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") #same as 'avichr/heBERT' tokenizer
model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")

# how to use?
sentiment_analysis = pipeline(
    "sentiment-analysis",
    model="avichr/heBERT_sentiment_analysis",
    tokenizer="avichr/heBERT_sentiment_analysis",
    return_all_scores = True
)

print(sentiment_analysis('אני מתלבט מה לאכול לארוחת צהריים'))
print(sentiment_analysis('קפה זה טעים'))
print(sentiment_analysis('אני לא אוהב את העולם'))

Our model is also available on AWS! For more information, visit AWS' git.

Advanced Usage - NER Model

from transformers import pipeline

# how to use?
NER = pipeline(
    "token-classification",
    model="avichr/heBERT_NER",
    tokenizer="avichr/heBERT_NER",
)
NER('דויד לומד באוניברסיטה העברית שבירושלים')

📚 Documentation

Citing the Model

If you use this model, please cite us as follows: Chriqui, A., & Yahav, I. (2022). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. INFORMS Journal on Data Science, forthcoming.

@article{chriqui2021hebert,
  title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
  author={Chriqui, Avihay and Yahav, Inbal},
  journal={INFORMS Journal on Data Science},
  year={2022}
}

🚧 Future Plans

We are still working on our model and will update this page as we make progress. Note that currently, we have only released the sentiment analysis (polarity) model, and the emotion detection model will be released later. Our GitHub repository: https://github.com/avichaychriqui/HeBERT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご