Open-Source Persian Model albert-fa-base-v2-ner-peyma - Provides Strong Support for Persian Language Processing

Albert Fa Base V2 Ner Peyma

Developed by m3hrdadfi

The first ALBERT model specifically for Persian, based on Google's ALBERT base v2.0 architecture, trained on diverse Persian corpora

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Persian Language Understanding #Lightweight BERT #Named Entity Recognition

Downloads 19

Release Time : 3/2/2022

Model Overview

A lightweight BERT model for self-supervised language representation learning in Persian, suitable for natural language processing tasks

Model Features

Lightweight Architecture

Based on ALBERT architecture with fewer parameters and higher efficiency compared to standard BERT models

Diverse Training Data

Trained on diverse Persian corpora comprising over 3.9 million documents, 73 million sentences, and 1.3 billion words

Named Entity Recognition Capability

Specially optimized for Persian named entity recognition tasks

Model Capabilities

Persian text understanding

Named Entity Recognition

Token classification

Use Cases

Natural Language Processing

Persian Named Entity Recognition

Identifying entities such as organizations, person names, and locations from Persian text

Achieved an F1 score of 88.99 on the PEYMA dataset

🚀 ALBERT Persian

A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

ALBERT-Persian is the first attempt on ALBERT for the Persian Language. It was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, similar to the approach used for ParsBERT.

Please follow the ALBERT-Persian repo for the latest information about previous and current models.

🚀 Quick Start

The project is about applying ALBERT to the Persian language, aiming to achieve self - supervised learning of language representations. You can refer to the official repository for the latest model information.

✨ Features

Persian NER [ARMAN, PEYMA]

This task focuses on extracting named entities in the text, such as names, and labeling them with appropriate NER classes like locations, organizations, etc. The datasets used for this task contain sentences marked in the IOB format. In this format, tokens not part of an entity are tagged as ”O”, the ”B” tag corresponds to the first word of an object, and the ”I” tag corresponds to the rest of the terms of the same entity. Both ”B” and ”I” tags are followed by a hyphen (or underscore), followed by the entity category. Thus, the NER task is a multi - class token classification problem that labels the tokens when fed a raw text. There are two primary datasets used in Persian NER, ARMAN, and PEYMA.

PEYMA

The PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens, of which 41,148 tokens are tagged with seven different classes:

Organization
Money
Location
Date
Time
Person
Percent

Label	#
Organization	16964
Money	2037
Location	8782
Date	4259
Time	732
Person	7675
Percent	699

Download You can download the dataset from here

📚 Documentation

Results

The following table summarizes the F1 score obtained as compared to other models and architectures.

Dataset	ALBERT - fa - base - v2	ParsBERT - v1	mBERT	MorphoBERT	Beheshti - NER	LSTM - CRF	Rule - Based CRF	BiLSTM - CRF
PEYMA	88.99	93.10	86.64	-	90.59	-	84.00	-

BibTeX entry and citation info

Please cite in publications as the following:

@misc{ALBERTPersian,
  author = {Mehrdad Farahani},
  title = {ALBERT - Persian: A Lite BERT for Self - supervised Learning of Language Representations for the Persian Language},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/m3hrdadfi/albert-persian}},
}

@article{ParsBERT,
    title={ParsBERT: Transformer - based Model for Persian Language Understanding},
    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
    journal={ArXiv},
    year={2020},
    volume={abs/2005.12515}
}

📄 License

This project is licensed under the Apache - 2.0 license.

💡 Usage Tip

If you have any questions, post a Github issue on the ALBERT - Persian repo.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご