bert-mini-arabic Open-Source Language Model - Free Support for Arabic Text Processing Tasks

Bert Mini Arabic

Developed by asafaya

Pre-trained Arabic BERT mini language model for Arabic text processing tasks

Large Language Model Arabic#Arabic BERT #Social Media Analysis #Multidialect Support

Downloads 363

Release Time : 3/2/2022

Model Overview

This model is an optimized mini version of BERT for Arabic, primarily used for natural language processing tasks such as text classification and named entity recognition.

Model Features

Arabic Optimization

Specifically pre-trained for Arabic text, supporting Modern Standard Arabic and some dialects

Mini Model

Lightweight BERT model suitable for resource-constrained environments

Multi-source Training Data

Trained using various Arabic resources such as OSCAR Arabic version and Wikipedia

Model Capabilities

Text Classification

Named Entity Recognition

Text Generation

Text Understanding

Use Cases

Social Media Analysis

Offensive Speech Detection

Used to identify offensive speech on social media

Performed well in SemEval-2020 Task 12

Text Processing

Arabic Text Classification

Classify Arabic text

🚀 Arabic BERT Mini Model

A pre-trained BERT Mini language model designed for Arabic, aiming to facilitate natural language processing tasks in the Arabic language.

If you use this model in your work, please cite this paper:

@inproceedings{safaya-etal-2020-kuisail,
    title = "{KUISAIL} at {S}em{E}val-2020 Task 12: {BERT}-{CNN} for Offensive Speech Identification in Social Media",
    author = "Safaya, Ali  and
      Abdullatif, Moutasem  and
      Yuret, Deniz",
    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
    month = dec,
    year = "2020",
    address = "Barcelona (online)",
    publisher = "International Committee for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.semeval-1.271",
    pages = "2054--2059",
}

✨ Features

Model Datasets

oscar
wikipedia

📚 Documentation

Pretraining Corpus

The arabic-bert-mini model was pre-trained on approximately 8.2 billion words, sourced from:

The Arabic version of OSCAR, filtered from Common Crawl.
A recent dump of the Arabic Wikipedia.

It also incorporates other Arabic resources, totaling around 95GB of text.

Notes on training data:

The final version of the corpus contains some non-Arabic words within sentences. These were not removed as it could impact tasks such as Named Entity Recognition (NER).
Although non-Arabic characters were lowercased as a preprocessing step, since Arabic characters do not have upper or lowercase, there is no cased and uncased version of the model.
The corpus and vocabulary set are not limited to Modern Standard Arabic; they also include some dialectical Arabic.

Pretraining Details

This model was trained using Google BERT's GitHub repository on a single TPU v3 - 8, provided free of charge by TFRC.
The pre-training procedure follows the training settings of BERT with some modifications: it was trained for 3 million training steps with a batch size of 128, instead of 1 million steps with a batch size of 256.

💻 Usage Examples

Basic Usage

You can use this model by installing torch or tensorflow and the Huggingface library transformers. Initialize it as follows:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-mini-arabic")
model = AutoModelForMaskedLM.from_pretrained("asafaya/bert-mini-arabic")

📦 Installation

To use the model, you need to install the following libraries:

torch or tensorflow
Huggingface library transformers

📄 License

No license information provided in the original document.

🔧 Technical Details

Model Performance

For further details on the model's performance or any other queries, please refer to Arabic-BERT.

📄 Acknowledgement

Thanks to Google for providing a free TPU for the training process and to Huggingface for hosting this model on their servers 😊

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご