๐ ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Googleโs BERT architecture, with the same configurations as BERT-Base. It aims to enhance Persian language understanding.
The paper presenting ParsBERT can be found at arXiv:2005.12515.
All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
โจ Features
Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
This task focuses on extracting named entities in the text, such as names, and labeling them with appropriate NER
classes like locations, organizations, etc. The datasets used for this task contain sentences marked in the IOB
format. In this format, tokens not part of an entity are tagged as โOโ
, the โBโ
tag corresponds to the first word of an object, and the โIโ
tag corresponds to the rest of the terms of the same entity. Both โBโ
and โIโ
tags are followed by a hyphen (or underscore), then the entity category. So, the NER task is a multi - class token classification problem that labels tokens when fed raw text. There are two primary datasets used in Persian NER, ARMAN
and PEYMA
. In ParsBERT, we prepared NER for both datasets and their combination.
PEYMA
The PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens, of which 41,148 tokens are tagged with seven different classes:
- Organization
- Money
- Location
- Date
- Time
- Person
- Percent
Label |
# |
Organization |
16964 |
Money |
2037 |
Location |
8782 |
Date |
4259 |
Time |
732 |
Person |
7675 |
Percent |
699 |
Download: You can download the dataset from [here](http://nsurl.org/tasks/task - 7 - named - entity - recognition - ner - for - farsi/)
ARMAN
The ARMAN dataset holds 7,682 sentences with 250,015 tokens tagged over six different classes:
- Organization
- Location
- Facility
- Event
- Product
- Person
Label |
# |
Organization |
30108 |
Location |
12924 |
Facility |
4458 |
Event |
7557 |
Product |
4389 |
Person |
15645 |
Download: You can download the dataset from here
๐ Results
The following table summarizes the F1 score obtained by ParsBERT compared to other models and architectures:
Dataset |
ParsBERT |
MorphoBERT |
Beheshti - NER |
LSTM - CRF |
Rule - Based CRF |
BiLSTM - CRF |
ARMAN + PEYMA |
95.13* |
- |
- |
- |
- |
- |
PEYMA |
98.79* |
- |
90.59 |
- |
84.00 |
- |
ARMAN |
93.10* |
89.9 |
84.03 |
86.55 |
- |
77.45 |
๐ป Usage Examples
Basic Usage
Notebook |
Description |
|
[How to use Pipelines](https://github.com/hooshvare/parsbert - ner/blob/master/persian - ner - pipeline.ipynb) |
A simple and efficient way to use State - of - the - Art models on downstream tasks through transformers |
[](https://colab.research.google.com/github/hooshvare/parsbert - ner/blob/master/persian - ner - pipeline.ipynb) |
๐ Documentation
Cite
Please cite the following paper in your publication if you are using ParsBERT in your research:
@article{ParsBERT,
title={ParsBERT: Transformer-based Model for Persian Language Understanding},
author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
journal={ArXiv},
year={2020},
volume={abs/2005.12515}
}
Acknowledgments
We express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing the necessary computation resources. We also thank the Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.
Contributors
- Mehrdad Farahani: Linkedin, Twitter, Github
- Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad - gharachorloo/), Twitter, Github
- Marzieh Farahani: Linkedin, Twitter, Github
- Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad - manthouri - aka - mansouri - 07030766/), Twitter, Github
- Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram
A special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara - tabrizi - 64548b79/), Behance, Instagram
Releases
Release v0.1 (May 29, 2019)
This is the first version of our ParsBERT NER!
๐ License
This project is licensed under the Apache - 2.0 license.