Fabert

Developed by sbunlp

BERT pretrained model based on Persian blog training, excelling in multiple Persian NLP tasks

Large Language Model

Transformers

Other#Persian Pretraining #Blog Corpus Optimization #Multitask NLP

Downloads 627

Release Time : 2/9/2024

Model Overview

FaBERT is a Persian BERT base model trained on diverse HmBlogs corpus, covering both spoken and written Persian texts, suitable for various natural language processing tasks.

Model Features

Diverse Corpus Training

Trained on over 50GB of Persian blog data (HmBlogs corpus), covering both spoken and written language

Compact and Efficient

124 million parameter scale, delivering outstanding performance while maintaining model compactness

Multitask Adaptability

Validated across multiple natural language understanding tasks, easily fine-tuned for downstream applications

Model Capabilities

Text understanding

Sentiment analysis

Named entity recognition

Question answering systems

Natural language inference

Use Cases

Sentiment analysis

Opinion mining

Analyzing sentiment tendencies in Persian texts

Achieved 87.51% accuracy on MirasOpinion dataset

Named entity recognition

Entity recognition

Identifying entities such as person names and locations in Persian texts

F1 score of 91.39 on PEYMA dataset

Question answering systems

Persian QA

Answering questions based on Persian texts

EM score of 55.87 on ParsiNLU dataset

language:

fa library_name: transformers widget:
- text: "ز سوزناکی گفتار من [MASK] بگریست" example_title: "Poetry 1"
- text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری" example_title: "Poetry 2"
- text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را" example_title: "Poetry 3"
- text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند" example_title: "Poetry 4"
- text: "این [MASK] اولشه." example_title: "Informal 1"
- text: "دیگه خسته شدم! [MASK] اینم شد کار؟!" example_title: "Informal 2"
- text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم." example_title: "Informal 3"
- text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم." example_title: "Informal 4"
- text: "زندگی بدون [MASK] خسته‌کننده است." example_title: "Formal 1"
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد." example_title: "Formal 2"

FaBERT: Pre-training BERT on Persian Blogs

Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

Features

Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
Remarkable performance across various downstream NLP tasks
BERT architecture with 124 million parameters

Useful Links

Repository: FaBERT on Github
Paper: arXiv preprint

Usage

Loading the Model with MLM head

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")

Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

Examples on Persian datasets are available in our GitHub repository.

make sure to use the default Fast Tokenizer

Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

Hyperparameter	Value
Batch Size	32
Optimizer	Adam
Learning Rate	6e-5
Weight Decay	0.01
Total Steps	18 Million
Warmup Steps	1.8 Million
Precision Format	TF32

Evaluation

Here are some key performance results for the FaBERT model:

Sentiment Analysis

Task	FaBERT	ParsBERT	XLM-R
MirasOpinion	87.51	86.73	84.92
MirasIrony	74.82	71.08	75.51
DeepSentiPers	79.85	74.94	79.00

Named Entity Recognition

Task	FaBERT	ParsBERT	XLM-R
PEYMA	91.39	91.24	90.91
ParsTwiner	82.22	81.13	79.50
MultiCoNER v2	57.92	58.09	51.47

Question Answering

Task	FaBERT	ParsBERT	XLM-R
ParsiNLU	55.87	44.89	42.55
PQuAD	87.34	86.89	87.60
PCoQA	53.51	50.96	51.12

Natural Language Inference & QQP

Task	FaBERT	ParsBERT	XLM-R
FarsTail	84.45	82.52	83.50
SBU-NLI	66.65	58.41	58.85
ParsiNLU QQP	82.62	77.60	79.74

Number of Parameters

	FaBERT	ParsBERT	XLM-R
Parameter Count (M)	124	162	278
Vocabulary Size (K)	50	100	250

For a more detailed performance analysis refer to the paper.

How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

@article{masumi2024fabert,
  title={FaBERT: Pre-training BERT on Persian Blogs},
  author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
  journal={arXiv preprint arXiv:2402.06617},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご