albert-fa-base-v2 Open-source Lightweight Model - Free Self-supervised Learning for Persian Language Representation

Home

Albert Fa Base V2

Developed by m3hrdadfi

A lightweight BERT model for self-supervised learning of Persian language representations

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Persian NLP #Lightweight BERT #Multi-task Fine-tuning

Downloads 43

Release Time : 3/2/2022

Model Overview

ALBERT-Persian is trained on a massive public corpus, primarily for fine-tuning downstream tasks such as sentiment analysis, text classification, and named entity recognition.

Model Features

Lightweight Design

Based on the ALBERT architecture, it is more lightweight compared to standard BERT models.

Diverse Training Data

Trained on various Persian data sources including Wikipedia, news, science, and lifestyle content.

Downstream Task Adaptation

Particularly suitable for fine-tuning downstream tasks like sentiment analysis, text classification, and named entity recognition.

Model Capabilities

Persian Text Understanding

Masked Language Modeling

Next Sentence Prediction

Sentiment Analysis

Text Classification

Named Entity Recognition

Use Cases

Sentiment Analysis

Digikala Review Sentiment Analysis

Analyze sentiment tendencies in user reviews on the e-commerce platform Digikala

F1 score 81.12

Snappfood Review Sentiment Analysis

Analyze sentiment tendencies in user reviews on the food delivery platform Snappfood

F1 score 85.79

Text Classification

Digikala Magazine Classification

Classify content in Digikala digital magazines

Accuracy 92.33

Persian News Classification

Classify Persian news content

Accuracy 97.01

Named Entity Recognition

Basic NER

Identify named entities in Persian text

PEYMA dataset F1 score 88.99

ARMAN Dataset NER

Named entity recognition on the ARMAN dataset

F1 score 97.43

🚀 ALBERT-Persian

A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

ALBERT-Persian is a model designed for self - supervised learning of Persian language representations. It offers a lightweight alternative for various natural language processing tasks in Persian.

🚀 Quick Start

Prerequisites

To use any type of Albert, you need to install sentencepiece. Run the following command in your notebook:

!pip install -q sentencepiece

Usage in TensorFlow 2.0

from transformers import AutoConfig, AutoTokenizer, TFAutoModel

config = AutoConfig.from_pretrained("m3hrdadfi/albert-fa-base-v2")
tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2")
model = TFAutoModel.from_pretrained("m3hrdadfi/albert-fa-base-v2")

text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)

>>> ['▁ما', '▁در', '▁هوش', 'واره', '▁معتقد', 'یم', '▁با', '▁انتقال', '▁صحیح', '▁دانش', '▁و', '▁اگاه', 'ی', '،', '▁همه', '▁افراد', '▁می', '▁توانند', '▁از', '▁ابزارهای', '▁هوشمند', '▁استفاده', '▁کنند', '.', '▁شعار', '▁ما', '▁هوش', '▁مصنوعی', '▁برای', '▁همه', '▁است', '.']

Usage in Pytorch

from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("m3hrdadfi/albert-fa-base-v2")
tokenizer = AutoTokenizer.from_pretrained("m3hrdadfi/albert-fa-base-v2")
model = AutoModel.from_pretrained("m3hrdadfi/albert-fa-base-v2")

✨ Features

Massive Training Data: ALBERT - Persian was trained on a large amount of public corpora, including Persian Wikidumps, MirasText, and six other manually - crawled text data from various websites such as BigBang Page (scientific), Chetor (lifestyle), Eligasht (itinerary), Digikala (digital magazine), Ted Talks (general conversational), and books (novels, storybooks, short stories from old to the contemporary era).
Multiple Down - stream Tasks: It can be used for masked language modeling, next sentence prediction, and is suitable for fine - tuning on downstream tasks like sentiment analysis, text classification, and named entity recognition.

📚 Documentation

Introduction

ALBERT - Persian is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, similar to the approach used for ParsBERT.

Please follow the ALBERT - Persian repo for the latest information about previous and current models.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine - tuned on a downstream task. See the model hub to look for fine - tuned versions on a task that interests you.

Training

The objective goals during training are as follows (after 140K steps):

***** Eval results *****
global_step = 140000
loss = 2.0080082
masked_lm_accuracy = 0.6141017
masked_lm_loss = 1.9963315
sentence_order_accuracy = 0.985
sentence_order_loss = 0.06908702

Derivative models

Base Config

Albert Model: m3hrdadfi/albert - face - base - v2
Albert Sentiment Analysis:
Albert Text Classification:
- m3hrdadfi/albert - fa - base - v2 - clf - digimag
- m3hrdadfi/albert - fa - base - v2 - clf - persiannews
Albert NER:
- m3hrdadfi/albert - fa - base - v2 - ner
- m3hrdadfi/albert - fa - base - v2 - ner - arman

Eval results

The following tables summarize the F1 scores obtained by ALBERT - Persian as compared to other models and architectures.

Sentiment Analysis (SA) Task

Dataset	ALBERT - fa - base - v2	ParsBERT - v1	mBERT	DeepSentiPers
Digikala User Comments	81.12	81.74	80.74	-
SnappFood User Comments	85.79	88.12	87.87	-
SentiPers (Multi Class)	66.12	71.11	-	69.33
SentiPers (Binary Class)	91.09	92.13	-	91.98

Text Classification (TC) Task

Dataset	ALBERT - fa - base - v2	ParsBERT - v1	mBERT
Digikala Magazine	92.33	93.59	90.72
Persian News	97.01	97.19	95.79

Named Entity Recognition (NER) Task

Dataset	ALBERT - fa - base - v2	ParsBERT - v1	mBERT	MorphoBERT	Beheshti - NER	LSTM - CRF	Rule - Based CRF	BiLSTM - CRF
PEYMA	88.99	93.10	86.64	-	90.59	-	84.00	-
ARMAN	97.43	98.79	95.89	89.9	84.03	86.55	-	77.45

BibTeX entry and citation info

Please cite in publications as the following:

@misc{ALBERT-Persian,
  author = {Mehrdad Farahani},
  title = {ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/m3hrdadfi/albert-persian}},
}

@article{ParsBERT,
    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
    journal={ArXiv},
    year={2020},
    volume={abs/2005.12515}
}

📄 License

This project is licensed under the Apache - 2.0 license.

Questions?

Post a Github issue on the ALBERT - Persian repo.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご