๐ ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googleโs BERT architecture, sharing the same configurations as BERT-Base. It aims to enhance Persian language understanding.
The paper presenting ParsBERT can be found at arXiv:2005.12515.
All the models (for downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
โจ Features
Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task focuses on extracting named entities from text, such as names, and labeling them with appropriate NER
classes like locations, organizations, etc. The datasets used for this task contain sentences marked in the IOB
format. In this format, tokens not part of an entity are tagged as "O"
, the "B"
tag represents the first word of an object, and the "I"
tag represents the remaining terms of the same entity. Both "B"
and "I"
tags are followed by a hyphen (or underscore), then the entity category. Thus, the NER task is a multi - class token classification problem that labels tokens when fed raw text. There are two primary datasets used in Persian NER, ARMAN
and PEYMA
. In ParsBERT, we prepared NER for both datasets as well as their combination.
ARMAN
The ARMAN dataset contains 7,682 sentences, with 250,015 tokens tagged across six different classes:
- Organization
- Location
- Facility
- Event
- Product
- Person
Label |
# |
Organization |
30108 |
Location |
12924 |
Facility |
4458 |
Event |
7557 |
Product |
4389 |
Person |
15645 |
Download
You can download the dataset from here
๐ Documentation
Results
The following table summarizes the F1 score obtained by ParsBERT compared to other models and architectures.
Dataset |
ParsBERT |
MorphoBERT |
Beheshti - NER |
LSTM - CRF |
Rule - Based CRF |
BiLSTM - CRF |
ARMAN |
93.10* |
89.9 |
84.03 |
86.55 |
- |
77.45 |
How to use :hugs:
Notebook |
Description |
|
How to use Pipelines |
A simple and efficient way to use State - of - the - Art models on downstream tasks through transformers |
 |
๐ License
This project is licensed under the Apache-2.0 license.
๐ Cite
Please cite the following paper in your publication if you are using ParsBERT in your research:
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
๐ Acknowledgments
We would like to express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing us with the necessary computation resources. We also thank the Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.
๐ฅ Contributors
- Mehrdad Farahani: Linkedin, Twitter, Github
- Mohammad Gharachorloo: Linkedin, Twitter, Github
- Marzieh Farahani: Linkedin, Twitter, Github
- Mohammad Manthouri: Linkedin, Twitter, Github
- Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram
A special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: Linkedin, Behance, Instagram
๐ Releases
Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!