EstBERT Open-source Estonian NLP Model - Dual Sequence Length Adapted for Multiple Natural Language Processing Tasks

Estbert

Developed by tartuNLP

EstBERT is a BERT model pre-trained specifically for Estonian, supporting two sequence lengths of 128 and 512, and performing excellently in multiple natural language processing tasks.

Large Language Model Other#Dedicated to Estonian #Optimized morphological analysis #Leading multi-task performance

Downloads 398

Release Time : 3/2/2022

Model Overview

EstBERT is a BERT model pre-trained on an Estonian corpus, mainly used for Estonian text understanding and processing tasks, such as part-of-speech tagging, named entity recognition, topic classification, and sentiment analysis.

Model Features

Dedicated to Estonian

Optimally trained specifically for Estonian, performing better than multilingual models on Estonian tasks.

Support for dual sequence lengths

Provides model versions with two sequence lengths of 128 and 512 to meet the needs of text processing of different lengths.

Superior comprehensive performance

Comprehensively outperforms mBERT and XLM-RoBERTa in tasks such as part-of-speech tagging, named entity recognition, topic classification, and sentiment analysis.

Model Capabilities

Part-of-speech tagging

Named entity recognition

Topic classification

Sentiment analysis

Text understanding

Masked language modeling

Use Cases

Natural language processing

Part-of-speech tagging

Perform part-of-speech tagging on Estonian texts

The UPOS accuracy reaches 97.89%, better than the comparison models

Sentiment analysis

Analyze the sentiment tendency of Estonian texts

The F1 score reaches 74.50, better than mBERT

🚀 EstBERT

The EstBERT model is a pre - trained BERT_Base model. It's specially trained on the Estonian cased corpus with data sequence lengths of both 128 and 512. This model offers better performance in various NLP tasks compared to some other models.

🚀 Quick Start

You can use the model transformer library in both tensorflow and pytorch versions.

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/EstBERT")
model = AutoModelForMaskedLM.from_pretrained("tartuNLP/EstBERT")

You can also download the pre - trained model from here, EstBERT_128 EstBERT_512

✨ Features

The EstBERT model is exclusively trained on Estonian cased corpus.
It performs better in parts of speech (POS), name entity recognition (NER), rubric, and sentiment classification tasks compared to mBERT and XLM - RoBERTa.

📦 Installation

You can install the necessary libraries to use the EstBERT model as shown in the quick start code above.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/EstBERT")
model = AutoModelForMaskedLM.from_pretrained("tartuNLP/EstBERT")

📚 Documentation

Dataset used to train the model

The EstBERT model is trained on data with both 128 and 512 sequence lengths. For training the EstBERT, we used the Estonian National Corpus 2017, which was the largest Estonian language corpus available at the time. It consists of four sub - corpora: Estonian Reference Corpus 1990 - 2008, Estonian Web Corpus 2013, Estonian Web Corpus 2017, and Estonian Wikipedia Corpus 2017.

Why would I use?

Overall, EstBERT performs better in parts of speech (POS), name entity recognition (NER), rubric, and sentiment classification tasks compared to mBERT and XLM - RoBERTa. The comparative results can be found in the following tables:

Model	UPOS	XPOS	Morph	bf UPOS	bf XPOS	Morph
EstBERT	*97.89*	98.40	96.93	97.84	*98.43*	*96.80*
mBERT	97.42	98.06	96.24	97.43	98.13	96.13
XLM - RoBERTa	97.78	98.36	96.53	97.80	98.40	96.69

Model	Rubric₁₂₈	Sentiment₁₂₈	Rubric₁₂₈	Sentiment₅₁₂
EstBERT	*81.70*	74.36	80.96	74.50
mBERT	75.67	70.23	74.94	69.52
XLM - RoBERTa	80.34	74.50	78.62	*76.07*

Model	Precision₁₂₈	Recall₁₂₈	F1 - Score₁₂₈	Precision₅₁₂	Recall₅₁₂	F1 - Score₅₁₂
EstBERT	88.42	90.38	*89.39*	88.35	89.74	89.04
mBERT	85.88	87.09	86.51	*88.47*	88.28	88.37
XLM - RoBERTa	87.55	*91.19*	89.34	87.50	90.76	89.10

Reference to cite

Tanvir et al 2021

📄 License

The license of this project is cc - by - 4.0.

🔧 Technical Details

BibTeX entry and citation info

@misc{tanvir2020estbert,
      title={EstBERT: A Pretrained Language - Specific BERT for Estonian}, 
      author={Hasan Tanvir and Claudia Kittask and Kairit Sirts},
      year={2020},
      eprint={2011.04784},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご