flair-arabic-multi-ner Open Source Model - Freely Identify Entities Such as Persons, Locations, and Institutions in Arabic

Flair Arabic Multi Ner

Developed by megantosh

This is an Arabic named entity recognition model based on Flair and GloVe embeddings, supporting entity types such as locations, organizations, and persons.

Sequence Labeling

PyTorch

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Arabic NER #Multi-embedding Fusion #High F1 Score

Downloads 739

Release Time : 3/2/2022

Model Overview

This model is specifically designed for named entity recognition tasks in Arabic text, combining Flair's bidirectional language model with GloVe word embeddings, trained on the AQMAR and ANERcorp datasets.

Model Features

Multi-embedding Combination

Combines GloVe static word embeddings with Flair's context-aware bidirectional language model embeddings

Arabic Language Optimization

Specifically optimized for Arabic's right-to-left writing characteristics

Multi-entity Type Recognition

Can identify four entity types: location (LOC), organization (ORG), person (PER), and miscellaneous (MISC)

Model Capabilities

Arabic Text Analysis

Named Entity Recognition

Sequence Labeling

Use Cases

Text Analysis

News Text Entity Extraction

Extracts person names, place names, and organization names from Arabic news

F1 score reaches 0.8666

Social Media Analysis

Identifies key entities in Arabic social media text

🚀 Arabic NER Model using Flair Embeddings

This project presents an Arabic Named Entity Recognition (NER) model. It was trained over 94 epochs, with a linear decaying learning rate starting from 0.225 down to 2e - 05, and a batch size of 32. GloVe and Flair forward and backward embeddings were used in the training process.

🚀 Quick Start

Prerequisites

Ensure you have the necessary libraries installed. You can install them using pip if not already installed.

Example Code

from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from icecream import ic

tagger = SequenceTagger.load("julien-c/flair-ner")
arTagger = SequenceTagger.load('megantosh/flair-arabic-multi-ner')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')

# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)

# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)

✨ Features

Multilingual Support: Supports both Arabic and English languages.
High Performance: Achieves an F1 - score (micro) of 0.8666 and F1 - score (macro) of 0.8488.
Multiple Embeddings: Utilizes GloVe and Flair forward and backward embeddings for better performance.

📦 Installation

The installation mainly involves installing the required Python libraries. You can install them using the following command:

pip install flair pyarabic icecream

💻 Usage Examples

Basic Usage

from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load("julien-c/flair-ner")
arTagger = SequenceTagger.load('megantosh/flair-arabic-multi-ner')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')

tagger.predict(sentence)
arTagger.predict(arSentence)

print(sentence.to_tagged_string())
print(arSentence.to_tagged_string())

Advanced Usage

The advanced usage can involve customizing the model configuration or using different embeddings. Here is an example of the model configuration:

SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4196, out_features=4196, bias=True)
  (rnn): LSTM(4196, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=15, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None

📚 Documentation

Original Datasets

Results

F1 - score (micro): 0.8666
F1 - score (macro): 0.8488

	Named Entity Type	True Posititves	False Positives	False Negatives	Precision	Recall	class - F1
LOC	Location	539	51	68	0.9136	0.8880	0.9006
MISC	Miscellaneous	408	57	89	0.8774	0.8209	0.8482
ORG	Organisation	167	43	64	0.7952	0.7229	0.7574
PER	Person (no title)	501	65	60	0.8852	0.8930	0.8891

Example Output

2021-07-07 14:30:59,649 loading file /Users/mega/.flair/models/flair-ner/f22eb997f66ae2eacad974121069abaefca5fe85fce71b49e527420ff45b9283.941c7c30b38aef8d8a4eb5c1b6dd7fe8583ff723fef457382589ad6a4e859cfc
2021-07-07 14:31:04,654 loading file /Users/mega/.flair/models/flair-arabic-multi-ner/c7af7ddef4fdcc681fcbe1f37719348afd2862b12aa1cfd4f3b93bd2d77282c7.242d030cb106124f7f9f6a88fb9af8e390f581d42eeca013367a86d585ee6dd6
ic| sentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "George Washington went to Washington ."   [− Tokens: 6  − Token-Labels: "George <B-PER> Washington <E-PER> went to Washington <S-LOC> ."]>
ic| arSentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة ."   [− Tokens: 11  − Token-Labels: "عمرو <B-PER> عادلي <I-PER> أستاذ للاقتصاد السياسي المساعد في الجامعة <B-ORG> الأمريكية <I-ORG> بالقاهرة <B-LOC> ."]>
ic| entity: <PER-span (1,2): "George Washington">
ic| entity: <LOC-span (5): "Washington">
ic| entity: <PER-span (1,2): "عمرو عادلي">
ic| entity: <ORG-span (8,9): "الجامعة الأمريكية">
ic| entity: <LOC-span (10): "بالقاهرة">
ic| sentence.to_dict(tag_type='ner'): 
{"text":"عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .",
"labels":[],
{"entities":[{{{
               "text":"عمرو عادلي",
               "start_pos":0,
               "end_pos":10,
               "labels":[PER (0.9826)]},
            {"text":"الجامعة الأمريكية",
               "start_pos":45,
               "end_pos":62,
               "labels":[ORG (0.7679)]},
            {"text":"بالقاهرة",
               "start_pos":64,
               "end_pos":72,
               "labels":[LOC (0.8079)]}]}
"text":"George Washington went to Washington .",
"labels":[],
"entities":[{
           {"text":"George Washington",
            "start_pos":0,
            "end_pos":17,
            "labels":[PER (0.9968)]},
           {"text":"Washington",
            "start_pos":26,
            "end_pos":36,
            "labels":[LOC (0.9994)]}}]}

🔧 Technical Details

The model was trained over 94 epochs with a linear decaying learning rate starting from 0.225 and ending at 2e - 05. A batch size of 32 was used, along with GloVe and Flair forward and backward embeddings.

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

If you use this model, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
year = {2021},
doi = "10.13140/RG.2.2.34961.10084"
url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
}

⚠️ Important Note

Due to the right - to - left in left - to - right context, some formatting errors might occur. And your code might appear like this (link accessed on 2020 - 10 - 27).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご