language:
- fa
library_name: transformers
widget:
- text: "ز سوزناکی گفتار من [MASK] بگریست"
example_title: "Poetry 1"
- text: "نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری"
example_title: "Poetry 2"
- text: "هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را"
example_title: "Poetry 3"
- text: "غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند"
example_title: "Poetry 4"
- text: "این [MASK] اولشه."
example_title: "Informal 1"
- text: "دیگه خسته شدم! [MASK] اینم شد کار؟!"
example_title: "Informal 2"
- text: "فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم."
example_title: "Informal 3"
- text: "تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم."
example_title: "Informal 4"
- text: "زندگی بدون [MASK] خستهکننده است."
example_title: "Formal 1"
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
example_title: "Formal 2"
FaBERT: Pre-training BERT on Persian Blogs
Model Details
FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.
Features
- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
- Remarkable performance across various downstream NLP tasks
- BERT architecture with 124 million parameters
Useful Links
Usage
Loading the Model with MLM head
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert")
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
Downstream Tasks
Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)
Examples on Persian datasets are available in our GitHub repository.
make sure to use the default Fast Tokenizer
Training Details
FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.
Hyperparameter |
Value |
Batch Size |
32 |
Optimizer |
Adam |
Learning Rate |
6e-5 |
Weight Decay |
0.01 |
Total Steps |
18 Million |
Warmup Steps |
1.8 Million |
Precision Format |
TF32 |
Evaluation
Here are some key performance results for the FaBERT model:
Sentiment Analysis
Task |
FaBERT |
ParsBERT |
XLM-R |
MirasOpinion |
87.51 |
86.73 |
84.92 |
MirasIrony |
74.82 |
71.08 |
75.51 |
DeepSentiPers |
79.85 |
74.94 |
79.00 |
Named Entity Recognition
Task |
FaBERT |
ParsBERT |
XLM-R |
PEYMA |
91.39 |
91.24 |
90.91 |
ParsTwiner |
82.22 |
81.13 |
79.50 |
MultiCoNER v2 |
57.92 |
58.09 |
51.47 |
Question Answering
Task |
FaBERT |
ParsBERT |
XLM-R |
ParsiNLU |
55.87 |
44.89 |
42.55 |
PQuAD |
87.34 |
86.89 |
87.60 |
PCoQA |
53.51 |
50.96 |
51.12 |
Natural Language Inference & QQP
Task |
FaBERT |
ParsBERT |
XLM-R |
FarsTail |
84.45 |
82.52 |
83.50 |
SBU-NLI |
66.65 |
58.41 |
58.85 |
ParsiNLU QQP |
82.62 |
77.60 |
79.74 |
Number of Parameters
|
FaBERT |
ParsBERT |
XLM-R |
Parameter Count (M) |
124 |
162 |
278 |
Vocabulary Size (K) |
50 |
100 |
250 |
For a more detailed performance analysis refer to the paper.
How to Cite
If you use FaBERT in your research or projects, please cite it using the following BibTeX:
@article{masumi2024fabert,
title={FaBERT: Pre-training BERT on Persian Blogs},
author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
journal={arXiv preprint arXiv:2402.06617},
year={2024}
}