BERT-base-Arabic Open-source Language Model - Supports Text Processing of Standard Arabic and Some Dialects

Bert Base Arabic

Developed by asafaya

Pretrained Arabic BERT base language model supporting Modern Standard Arabic and some dialects

Large Language Model Arabic#Arabic Pretraining #BERT Architecture #Social Media Analysis

Downloads 14.40k

Release Time : 3/2/2022

Model Overview

This model is a BERT-based pretrained Arabic language model suitable for various Arabic natural language processing tasks.

Model Features

Large-scale Pretraining Data

Trained on approximately 8.2 billion words of Arabic corpus, including OSCAR and Wikipedia data

Dialect Support

Supports not only Modern Standard Arabic but also includes some Arabic dialect content

TPU-optimized Training

Trained for 3 million steps using Google TPU v3-8, optimizing training efficiency

Model Capabilities

Text Understanding

Text Generation

Named Entity Recognition

Text Classification

Use Cases

Social Media Analysis

Offensive Speech Detection

Used to identify offensive Arabic content on social media

Achieved good performance in SemEval-2020 Task 12

Information Extraction

Arabic NER

Used for Arabic named entity recognition tasks

🚀 Arabic BERT Model

This is a pre - trained BERT base language model designed for the Arabic language, which can effectively handle various natural language processing tasks in Arabic.

🚀 Quick Start

You can use this model by installing torch or tensorflow and Huggingface library transformers. And you can use it directly by initializing it like this:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
model = AutoModelForMaskedLM.from_pretrained("asafaya/bert-base-arabic")

✨ Features

Pretrained on a large - scale Arabic corpus, which can better understand the semantics of Arabic.
Can be used for various natural language processing tasks such as text classification, named entity recognition, etc.

📦 Installation

To use this model, you need to install torch or tensorflow and the Huggingface library transformers.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")
model = AutoModelForMaskedLM.from_pretrained("asafaya/bert-base-arabic")

📚 Documentation

Pretraining Corpus

The arabic - bert - base model was pretrained on ~8.2 Billion words:

Arabic version of OSCAR - filtered from Common Crawl
Recent dump of Arabic Wikipedia

and other Arabic resources which sum up to ~95GB of text.

Notes on training data:

Our final version of corpus contains some non - Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
Although non - Arabic characters were lowered as a preprocessing step, since Arabic characters does not have upper or lower case, there is no cased and uncased version of the model.
The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.

Pretraining details

This model was trained using Google BERT's github repository on a single TPU v3 - 8 provided for free from TFRC.
Our pretraining procedure follows training settings of bert with some changes: trained for 3M training steps with batchsize of 128, instead of 1M with batchsize of 256.

Results

For further details on the models performance or any other queries, please refer to [Arabic - BERT](https://github.com/alisafaya/Arabic - BERT)

🔧 Technical Details

The model is based on the BERT architecture, which is a powerful pre - trained language model architecture.
The training data comes from multiple Arabic resources, ensuring the diversity and richness of the data.
The training process was carried out on a TPU v3 - 8, which greatly improved the training efficiency.

📄 License

If you use this model in your work, please cite this paper:

@inproceedings{safaya-etal-2020-kuisail,
    title = "{KUISAIL} at {S}em{E}val-2020 Task 12: {BERT}-{CNN} for Offensive Speech Identification in Social Media",
    author = "Safaya, Ali  and
      Abdullatif, Moutasem  and
      Yuret, Deniz",
    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
    month = dec,
    year = "2020",
    address = "Barcelona (online)",
    publisher = "International Committee for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.semeval-1.271",
    pages = "2054--2059",
}

Acknowledgement

Thanks to Google for providing free TPU for the training process and for Huggingface for hosting this model on their servers 😊

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご