ParsBERT Open-source Persian Language Understanding Model - Free Deployment to Aid Persian Named Entity Recognition

Bert Base Parsbert Peymaner Uncased

Developed by HooshvareLab

ParsBERT is a Persian language understanding model based on the BERT architecture, specifically optimized for Persian Named Entity Recognition (NER) tasks.

Sequence Labeling OtherOpen Source License:Apache-2.0 #Persian Named Entity Recognition #Whole Word Masking Training #BERT Architecture

Downloads 40

Release Time : 3/2/2022

Model Overview

This model is optimized for Persian Named Entity Recognition tasks, capable of identifying entities such as person names, locations, and organizations in text, using the IOB annotation format.

Model Features

Whole Word Masking Training

Utilizes Whole Word Masking (WWM) technology for pre-training, enhancing the model's understanding of Persian language.

Multi-dataset Support

Supports two major Persian NER datasets, ARMAN and PEYMA, as well as their combinations.

IOB Annotation Format

Uses the standard IOB format for entity annotation, facilitating integration with other systems.

Model Capabilities

Persian Text Understanding

Named Entity Recognition

Entity Classification Annotation

Use Cases

Information Extraction

News Entity Extraction

Extract key information such as person names, organization names, and locations from Persian news texts.

Accurately identifies various named entities in Persian texts.

Social Media Analysis

Analyze mentioned entities in Persian social media content.

Helps understand the people and organizations related to topics.

🚀 ParsBERT: Transformer-based Model for Persian Language Understanding

ParsBERT is a monolingual language model based on Google’s BERT architecture, having the same configurations as BERT-Base. It aims to enhance Persian language understanding.

The paper presenting ParsBERT can be found at arXiv:2005.12515. All the models (for downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)

✨ Features

Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]

This task focuses on extracting named entities in the text, like names, and labeling them with appropriate NER classes such as locations, organizations, etc. The datasets used for this task contain sentences marked in the IOB format. In this format, tokens not part of an entity are tagged as ”O”, the ”B” tag corresponds to the first word of an object, and the ”I” tag corresponds to the rest of the terms of the same entity. Both ”B” and ”I” tags are followed by a hyphen (or underscore), followed by the entity category. So, the NER task is a multi - class token classification problem that labels tokens when fed a raw text. There are two primary datasets used in Persian NER, ARMAN and PEYMA. In ParsBERT, we prepared NER for both datasets as well as a combination of them.

PEYMA

The PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens, out of which 41,148 tokens are tagged with seven different classes:

Organization
Money
Location
Date
Time
Person
Percent

Label	#
Organization	16964
Money	2037
Location	8782
Date	4259
Time	732
Person	7675
Percent	699

Download: You can download the dataset from here

Results

The following table summarizes the F1 score obtained by ParsBERT compared to other models and architectures.

Dataset	ParsBERT	MorphoBERT	Beheshti-NER	LSTM-CRF	Rule-Based CRF	BiLSTM-CRF
PEYMA	98.79*	-	90.59	-	84.00	-

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

How to use :hugs:

Notebook	Description
How to use Pipelines	A simple and efficient way to use State-of-the-Art models on downstream tasks through transformers

📚 Documentation

No detailed documentation other than the above information is provided in the original document, so this section is skipped.

🔧 Technical Details

No specific technical implementation details are provided in the original document, so this section is skipped.

📄 License

The project is licensed under the Apache-2.0 license.

📖 Cite

Please cite the following paper in your publication if you are using ParsBERT in your research:

@article{ParsBERT,
    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
    journal={ArXiv},
    year={2020},
    volume={abs/2005.12515}
}

🙏 Acknowledgments

We express our gratitude to the Tensorflow Research Cloud (TFRC) program for providing us with the necessary computation resources. We also thank Hooshvare Research Group for facilitating dataset gathering and scraping online text resources.

👥 Contributors

Mehrdad Farahani: Linkedin, Twitter, Github
Mohammad Gharachorloo: Linkedin, Twitter, Github
Marzieh Farahani: Linkedin, Twitter, Github
Mohammad Manthouri: Linkedin, Twitter, Github
Hooshvare Team: Official Website, Linkedin, Twitter, Github, Instagram

A special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: Linkedin, Behance, Instagram

📤 Releases

Release v0.1 (May 29, 2019)

This is the first version of our ParsBERT NER!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご