TookaBERT-Base Open-Source Encoder Model - Free for Persian Natural Language Processing Tasks

Tookabert Base

Developed by PartAI

TookaBERT is a family of encoder models trained on Persian, including base and large versions, suitable for various natural language processing tasks.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Persian Masked Language Modeling #Large-scale Pretraining #Multi-task Fine-tuning

Downloads 127

Release Time : 4/29/2024

Model Overview

The TookaBERT model is a family of encoder models trained on Persian, suitable for masked language modeling tasks, supporting various downstream tasks such as sentiment analysis, text classification, multiple-choice, question answering, and named entity recognition.

Model Features

Multi-topic Pretraining

Pretrained on over 500GB of Persian data, covering various topics such as news, blogs, forums, and books.

Masked Language Modeling

Pretrained using the Whole Word Masking (WWM) objective, supporting masked language modeling tasks.

Multi-task Support

Supports various downstream tasks, including sentiment analysis, text classification, multiple-choice, question answering, and named entity recognition.

Model Capabilities

Masked Language Modeling

Sentiment Analysis

Text Classification

Multiple-Choice

Question Answering

Named Entity Recognition

Use Cases

Sentiment Analysis

DeepSentiPers

Used for Persian sentiment analysis tasks

f1/acc: 85.66/85.78 (TookaBERT-large)

Named Entity Recognition

MultiCoNER-v2

Used for Persian named entity recognition tasks

f1/acc: 69.69/94.07 (TookaBERT-large)

Question Answering

PQuAD

Used for Persian question answering tasks

best_exact/best_f1/HasAns_exact/HasAns_f1: 75.56/88.06/70.24/87.83 (TookaBERT-large)

🚀 TookaBERT Model

TookaBERT models are a family of encoder models designed for Persian language processing. They come in two sizes, base and large, and are pre - trained on over 500GB of diverse Persian data, covering various topics like news, blogs, forums, and books. These models are pre - trained with the MLM (WWM) objective using two context lengths.

🚀 Quick Start

Model Details

TookaBERT models are a family of encoder models trained on Persian in two sizes: base and large. These models were pre - trained on over 500GB of Persian data, including a wide variety of topics such as News, Blogs, Forums, Books, etc. They were pre - trained with the MLM (WWM) objective using two context lengths.

For more information, you can read our paper on arXiv.

✨ Features

Trained on a large corpus of Persian data (over 500GB).
Available in two sizes: base and large.
Pre - trained with the MLM (WWM) objective.

💻 Usage Examples

Basic Usage

You can use this model directly for Masked Language Modeling using the provided code below.

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("PartAI/TookaBERT-Base")
model = AutoModelForMaskedLM.from_pretrained("PartAI/TookaBERT-Base")

# prepare input
text = "شهر برلین در کشور <mask> واقع شده است."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)

Advanced Usage

It is also possible to use inference pipelines such as below.

from transformers import pipeline

inference_pipeline = pipeline('fill-mask', model="PartAI/TookaBERT-Base")
inference_pipeline("شهر برلین در کشور <mask> واقع شده است.")

You can use this model to fine - tune it over your dataset and prepare it for your task.

DeepSentiPers (Sentiment Analysis)
ParsiNLU - Multiple - choice (Multiple - choice)

📚 Documentation

Evaluation

TookaBERT models are evaluated on a wide range of NLP downstream tasks, such as Sentiment Analysis (SA), Text Classification, Multiple - choice, Question Answering, and Named Entity Recognition (NER). Here are some key performance results:

Model name	DeepSentiPers (f1/acc)	MultiCoNER - v2 (f1/acc)	PQuAD (best_exact/best_f1/HasAns_exact/HasAns_f1)	FarsTail (f1/acc)	ParsiNLU - Multiple - choice (f1/acc)	ParsiNLU - Reading - comprehension (exact/f1)	ParsiNLU - QQP (f1/acc)
TookaBERT - large	85.66/85.78	69.69/94.07	75.56/88.06/70.24/87.83	89.71/89.72	36.13/35.97	33.6/60.5	82.72/82.63
TookaBERT - base	83.93/83.93	66.23/93.3	73.18/85.71/68.29/85.94	83.26/83.41	33.6/33.81	20.8/42.52	81.33/81.29
Shiraz	81.17/81.08	59.1/92.83	65.96/81.25/59.63/81.31	77.76/77.75	34.73/34.53	17.6/39.61	79.68/79.51
ParsBERT	80.22/80.23	64.91/93.23	71.41/84.21/66.29/84.57	80.89/80.94	35.34/35.25	20/39.58	80.15/80.07
XLM - V - base	83.43/83.36	58.83/92.23	73.26/85.69/68.21/85.56	81.1/81.2	35.28/35.25	8/26.66	80.1/79.96
XLM - RoBERTa - base	83.99/84.07	60.38/92.49	73.72/86.24/68.16/85.8	82.0/81.98	32.4/32.37	20.0/40.43	79.14/78.95
FaBERT	82.68/82.65	63.89/93.01	72.57/85.39/67.16/85.31	83.69/83.67	32.47/32.37	27.2/48.42	82.34/82.29
mBERT	78.57/78.66	60.31/92.54	71.79/84.68/65.89/83.99	82.69/82.82	33.41/33.09	27.2/42.18	79.19/79.29
AriaBERT	80.51/80.51	60.98/92.45	68.09/81.23/62.12/80.94	74.47/74.43	30.75/30.94	14.4/35.48	79.09/78.84

*Note: because of the randomness in the fine - tuning process, results with less than 1% differences are considered together.

📄 License

This model is licensed under the Apache - 2.0 license.

🔧 Technical Details

The model is trained on Persian data using the MLM (WWM) objective with two context lengths. It comes in two sizes, base and large, to suit different application requirements.

📞 Contact us

If you have any questions regarding this model, you can reach us via the community of the model in Hugging Face.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご