๐ ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googleโs BERT architecture, having the same configurations as BERT-Base. It aims to enhance Persian language understanding.
The paper presenting ParsBERT can be found at arXiv:2005.12515. All the models (for downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
โจ Features
Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task focuses on extracting named entities in the text, like names, and labeling them with appropriate NER
classes such as locations, organizations, etc. The datasets used for this task contain sentences marked in the IOB
format. In this format, tokens not part of an entity are tagged as โOโ
, the โBโ
tag corresponds to the first word of an object, and the โIโ
tag corresponds to the rest of the terms of the same entity. Both โBโ
and โIโ
tags are followed by a hyphen (or underscore), followed by the entity category. So, the NER task is a multi - class token classification problem that labels tokens when fed a raw text. There are two primary datasets used in Persian NER, ARMAN
and PEYMA
. In ParsBERT, we prepared NER for both datasets as well as a combination of them.
PEYMA
The PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens, out of which 41,148 tokens are tagged with seven different classes:
- Organization
- Money
- Location
- Date
- Time
- Person
- Percent
Label |
# |
Organization |
16964 |
Money |
2037 |
Location |
8782 |
Date |
4259 |
Time |
732 |
Person |
7675 |
Percent |
699 |
Download: You can download the dataset from here
Results
The following table summarizes the F1 score obtained by ParsBERT compared to other models and architectures.
Dataset |
ParsBERT |
MorphoBERT |
Beheshti-NER |
LSTM-CRF |
Rule-Based CRF |
BiLSTM-CRF |
PEYMA |
98.79* |
- |
90.59 |
- |
84.00 |
- |
๐ฆ Installation
No installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
How to use :hugs:
Notebook |
Description |
|
How to use Pipelines |
A simple and efficient way to use State-of-the-Art models on downstream tasks through transformers |
 |
๐ Documentation
No detailed documentation other than the above information is provided in the original document, so this section is skipped.
๐ง Technical Details
No specific technical implementation details are provided in the original document, so this section is skipped.
๐ License
The project is licensed under the Apache-2.0 license.
๐ Cite
Please cite the following paper in your publication if you are using ParsBERT in your research:
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
๐ Acknowledgments
We express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing us with the necessary computation resources. We also thank Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.
๐ฅ Contributors
- Mehrdad Farahani: Linkedin, Twitter, Github
- Mohammad Gharachorloo: Linkedin, Twitter, Github
- Marzieh Farahani: Linkedin, Twitter, Github
- Mohammad Manthouri: Linkedin, Twitter, Github
- Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram
A special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: Linkedin, Behance, Instagram
๐ค Releases
Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!