Bert-base-parsbert-ner-uncased Open-source Model - Free Deployment to Support Persian Named Entity Recognition

Bert Base Parsbert Ner Uncased

Developed by HooshvareLab

Transformer-based Persian language understanding model, optimized for Persian Named Entity Recognition (NER) tasks

Sequence Labeling OtherOpen Source License:Apache-2.0 #Persian NER #Whole Word Masking Training #High-precision Entity Recognition

Downloads 6,130

Release Time : 3/2/2022

Model Overview

ParsBERT is a monolingual Persian model based on the BERT architecture, excelling on Persian NER datasets like ARMAN and PEYMA, supporting recognition of 7 entity types

Model Features

Whole Word Masking Training

Utilizes Whole Word Masking technology to enhance Persian entity recognition performance

Dual Dataset Support

Supports both PEYMA and ARMAN major Persian NER benchmark datasets

SOTA Performance

Achieves 98.79 F1 score on PEYMA dataset, significantly outperforming other Persian NER models

Model Capabilities

Persian text entity recognition

Organization name detection

Geographical name recognition

Person name extraction

Time/date recognition

Currency/percentage detection

Use Cases

Information Extraction

News Text Analysis

Automatically extracts key entities like person names and organizations from Persian news

Achieves 93.10 F1 score on ARMAN dataset

Business Intelligence

Financial Document Processing

Identifies currency amounts and percentage data in Persian financial reports

Over 90% accuracy in currency recognition on PEYMA dataset

🚀 ParsBERT: Transformer-based Model for Persian Language Understanding

ParsBERT is a monolingual language model based on Google’s BERT architecture, with the same configurations as BERT-Base. It aims to enhance Persian language understanding.

The paper presenting ParsBERT can be found at arXiv:2005.12515.

All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)

✨ Features

Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]

This task focuses on extracting named entities in the text, such as names, and labeling them with appropriate NER classes like locations, organizations, etc. The datasets used for this task contain sentences marked in the IOB format. In this format, tokens not part of an entity are tagged as ”O”, the ”B” tag corresponds to the first word of an object, and the ”I” tag corresponds to the rest of the terms of the same entity. Both ”B” and ”I” tags are followed by a hyphen (or underscore), then the entity category. So, the NER task is a multi - class token classification problem that labels tokens when fed raw text. There are two primary datasets used in Persian NER, ARMAN and PEYMA. In ParsBERT, we prepared NER for both datasets and their combination.

PEYMA

The PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens, of which 41,148 tokens are tagged with seven different classes:

Organization
Money
Location
Date
Time
Person
Percent

Label	#
Organization	16964
Money	2037
Location	8782
Date	4259
Time	732
Person	7675
Percent	699

Download: You can download the dataset from [here](http://nsurl.org/tasks/task - 7 - named - entity - recognition - ner - for - farsi/)

ARMAN

The ARMAN dataset holds 7,682 sentences with 250,015 tokens tagged over six different classes:

Organization
Location
Facility
Event
Product
Person

Label	#
Organization	30108
Location	12924
Facility	4458
Event	7557
Product	4389
Person	15645

Download: You can download the dataset from here

📊 Results

The following table summarizes the F1 score obtained by ParsBERT compared to other models and architectures:

Dataset	ParsBERT	MorphoBERT	Beheshti - NER	LSTM - CRF	Rule - Based CRF	BiLSTM - CRF
ARMAN + PEYMA	95.13*	-	-	-	-	-
PEYMA	98.79*	-	90.59	-	84.00	-
ARMAN	93.10*	89.9	84.03	86.55	-	77.45

💻 Usage Examples

Basic Usage

Notebook	Description
[How to use Pipelines](https://github.com/hooshvare/parsbert - ner/blob/master/persian - ner - pipeline.ipynb)	A simple and efficient way to use State - of - the - Art models on downstream tasks through transformers	[![Open In Colab](https://colab.research.google.com/assets/colab - badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert - ner/blob/master/persian - ner - pipeline.ipynb)

📚 Documentation

Cite

Please cite the following paper in your publication if you are using ParsBERT in your research:

@article{ParsBERT,
    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
    journal={ArXiv},
    year={2020},
    volume={abs/2005.12515}
}

Acknowledgments

We express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing the necessary computation resources. We also thank the Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.

Contributors

Mehrdad Farahani: Linkedin, Twitter, Github
Mohammad Gharachorloo: [Linkedin](https://www.linkedin.com/in/mohammad - gharachorloo/), Twitter, Github
Marzieh Farahani: Linkedin, Twitter, Github
Mohammad Manthouri: [Linkedin](https://www.linkedin.com/in/mohammad - manthouri - aka - mansouri - 07030766/), Twitter, Github
Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram

A special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara - tabrizi - 64548b79/), Behance, Instagram

Releases

Release v0.1 (May 29, 2019)

This is the first version of our ParsBERT NER!

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご