Open-source HeBERT_sentiment_analysis model - Supports Hebrew sentiment polarity analysis and recognition

Hebert Sentiment Analysis

Developed by avichr

HeBERT is a pre-trained language model for Hebrew, focusing on polarity analysis and sentiment recognition tasks.

Large Language Model #Hebrew Sentiment Analysis #News Comment Polarity Recognition #Multi-emotion Classification

Downloads 9,673

Release Time : 3/2/2022

Model Overview

HeBERT is a pre-trained BERT model for Hebrew, based on the BERT-Base architecture, specifically optimized for sentiment analysis and emotion recognition tasks.

Model Features

Hebrew-Specific

A pre-trained model specifically optimized for Hebrew, trained on a large corpus of Hebrew text.

Optimized for Sentiment Analysis

Excels in sentiment analysis tasks, particularly with a high F1 score of 0.98 for negative sentiment recognition.

Multi-source Training Data

Trained on a combination of OSCAR Hebrew corpus, Wikipedia, and specially collected sentiment UGC data.

High-Quality Annotation

Sentiment UGC data undergoes rigorous annotation and consistency validation, maintaining high-quality annotations with Krippendorff's alpha > 0.7.

Model Capabilities

Sentiment Polarity Analysis

Emotion Recognition

Masked Language Modeling

Hebrew Text Understanding

Use Cases

Social Media Analysis

News Comment Sentiment Analysis

Analyze the sentiment tendencies of users in news website comment sections.

Negative sentiment recognition achieves an F1 score of 0.98.

Market Research

Product Review Sentiment Analysis

Analyze user sentiment in Hebrew product reviews.

Positive sentiment recognition achieves an F1 score of 0.94.

🚀 HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition

HeBERT is a Hebrew pre-trained language model. It is based on Google's BERT architecture with the BERT-Base configuration (Devlin et al. 2018). This model aims to facilitate polarity analysis and emotion recognition in Hebrew text.

🚀 Quick Start

HeBert was trained on three datasets:

A Hebrew version of OSCAR (Ortiz, 2019): Approximately 9.8 GB of data, containing 1 billion words and over 20.8 million sentences.
A Hebrew dump of Wikipedia: Around 650 MB of data, including over 63 million words and 3.8 million sentences.
Emotion UGC data was collected specifically for this study (described below). We evaluated the model on emotion recognition and sentiment analysis for downstream tasks.

✨ Features

Emotion UGC Data Description

Our User-Generated Content (UGC) consists of comments written on articles collected from 3 major news sites between January 2020 and August 2020. The total data size is approximately 150 MB, including over 7 million words and 350K sentences. 4000 sentences were annotated by crowd members (3 - 10 annotators per sentence) for 8 emotions (anger, disgust, expectation, fear, happy, sadness, surprise, and trust) and overall sentiment/polarity. To validate the annotation, we searched for an agreement between raters on the emotion in each sentence using Krippendorff's alpha (krippendorff, 1970). We retained sentences with an alpha > 0.7. Note that while there was a general agreement between raters about emotions like happiness, trust, and disgust, there were a few emotions with general disagreement, presumably due to the complexity of identifying them in the text (e.g., expectation and surprise).

Performance

Sentiment Analysis

	Precision	Recall	F1-Score
Natural	0.83	0.56	0.67
Positive	0.96	0.92	0.94
Negative	0.97	0.99	0.98
Accuracy			0.97
Macro Avg	0.92	0.82	0.86
Weighted Avg	0.96	0.97	0.96

💻 Usage Examples

For Masked-LM Model (Can be Fine-Tuned to Any Downstream Task)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
model = AutoModel.from_pretrained("avichr/heBERT")

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="avichr/heBERT",
    tokenizer="avichr/heBERT"
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")

For Sentiment Classification Model (Polarity ONLY)

from transformers import AutoTokenizer, AutoModel, pipeline
tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") # same as 'avichr/heBERT' tokenizer
model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")

# How to use?
sentiment_analysis = pipeline(
    "sentiment-analysis",
    model="avichr/heBERT_sentiment_analysis",
    tokenizer="avichr/heBERT_sentiment_analysis",
    return_all_scores = True
)

>>>  sentiment_analysis('אני מתלבט מה לאכול לארוחת צהריים')
[[{'label': 'natural', 'score': 0.9978172183036804},
{'label': 'positive', 'score': 0.0014792329166084528},
{'label': 'negative', 'score': 0.0007035882445052266}]]

>>>  sentiment_analysis('קפה זה טעים')
[[{'label': 'natural', 'score': 0.00047328314394690096},
{'label': 'positive', 'score': 0.9994067549705505},
{'label': 'negative', 'score': 0.00011996887042187154}]]

>>>  sentiment_analysis('אני לא אוהב את העולם')
[[{'label': 'natural', 'score': 9.214012970915064e-05}, 
{'label': 'positive', 'score': 8.876807987689972e-05}, 
{'label': 'negative', 'score': 0.9998190999031067}]]

Our model is also available on AWS! For more information, visit AWS' git

📚 Documentation

Stay Tuned!

We are still working on our model and will update this page as we make progress. Note that we have only released sentiment analysis (polarity) at this point, and emotion detection will be released later. Our git: https://github.com/avichaychriqui/HeBERT

Citation

If you used this model, please cite us as: Chriqui, A., & Yahav, I. (2021). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. arXiv preprint arXiv:2102.01909.

@article{chriqui2021hebert,
  title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
  author={Chriqui, Avihay and Yahav, Inbal},
  journal={arXiv preprint arXiv:2102.01909},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご