TookaBERT-Large Open-source Model - Based on massive Persian data to meet multi-topic requirements

Tookabert Large

Developed by PartAI

TookaBERT is a series of encoder models trained on Persian, including a base version and a large - model version. It is pre - trained on over 500GB of Persian data covering various topics.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Persian pre - training #Large encoder #Leading in multi - task

Downloads 271

Release Time : 4/29/2024

Model Overview

TookaBERT is a pre - trained language model specifically designed for Persian, using the MLM (WWM) objective function and supporting multiple NLP downstream tasks. TookaBERT - Large is the first large encoder model pre - trained on Persian, leading in Persian tasks.

Model Features

Large - scale Persian pre - training

Pre - trained on over 500GB of Persian data covering various topics such as news, blogs, forums, and books.

Two model specifications

Offers a base version and a large - model version to meet different computing resource and performance requirements.

Advanced training objective

Uses the MLM (WWM) objective function and is pre - trained under two context lengths to improve the model's understanding ability.

Leading performance

TookaBERT - Large is the first large encoder model pre - trained on Persian, performing best in multiple Persian NLP tasks.

Model Capabilities

Masked language modeling

Text classification

Sentiment analysis

Named entity recognition

Question - answering system

Multiple - choice tasks

Reading comprehension

Use Cases

Sentiment analysis

DeepSentiPers sentiment analysis

Used for sentiment analysis tasks of Persian texts

F1 score: 85.66, Accuracy: 85.78

Named entity recognition

MultiCoNER - v2 entity recognition

Used for Persian named entity recognition tasks

F1 score: 69.69, Accuracy: 94.07

Question - answering system

PQuAD question - answering task

Used for Persian question - answering tasks

Best exact match: 75.56, Best F1 score: 88.06

Text reasoning

FarsTail text reasoning

Used for Persian text reasoning tasks

F1 score: 89.71, Accuracy: 89.72

🚀 TookaBERT Model

TookaBERT models are encoder models designed for Persian language processing. They come in two sizes, base and large, and are pre - trained on over 500GB of Persian data covering diverse topics like news, blogs, forums, and books. These models are pre - trained with the MLM (WWM) objective using two context lengths. Notably, TookaBERT - Large is the first large encoder model pre - trained on Persian and currently stands as the state - of - the - art model for Persian tasks.

Model Metadata

Property	Details
License	Apache-2.0
Language	Persian (fa)
Pipeline Tag	fill - mask
Mask Token	`<mask>`

Widget Examples

"توانا بود هر که بود ز دانش دل پیر برنا بود"
"شهر برلین در کشور واقع شده است."
"بهنام از خوانندگان مشهور کشور ما است."
"رضا از بازیگران مشهور کشور ما است."
"سید ابراهیم رییسی در سال رییس جمهور ایران شد."
"دیگر امکان ادامه وجود ندارد. باید قرارداد را کنیم."

🚀 Quick Start

✨ Features

Two Sizes: Available in base and large sizes to suit different needs.
Extensive Training Data: Pre - trained on over 500GB of diverse Persian data.
State - of - the - Art: TookaBERT - Large is the leading model for Persian tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly for Masked Language Modeling using the provided code below.

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("PartAI/TookaBERT-Large")
model = AutoModelForMaskedLM.from_pretrained("PartAI/TookaBERT-Large")

# prepare input
text = "شهر برلین در کشور <mask> واقع شده است."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)

Advanced Usage

It is also possible to use inference pipelines such as below.

from transformers import pipeline

inference_pipeline = pipeline('fill-mask', model="PartAI/TookaBERT-Large")
inference_pipeline("شهر برلین در کشور <mask> واقع شده است.")

You can use this model to fine - tune it over your dataset and prepare it for your task.

DeepSentiPers (Sentiment Analysis)
ParsiNLU - Multiple - choice (Multiple - choice)

📚 Documentation

For more information you can read our paper on arXiv.

🔧 Technical Details

TookaBERT models are pre - trained on Persian data with the MLM (WWM) objective using two context lengths. The large version, TookaBERT - Large, is a significant milestone as the first large encoder model pre - trained on Persian.

📄 License

The model is licensed under the Apache - 2.0 license.

📊 Evaluation

TookaBERT models are evaluated on a wide range of NLP downstream tasks, such as Sentiment Analysis (SA), Text Classification, Multiple - choice, Question Answering, and Named Entity Recognition (NER). Here are some key performance results:

Model name	DeepSentiPers (f1/acc)	MultiCoNER - v2 (f1/acc)	PQuAD (best_exact/best_f1/HasAns_exact/HasAns_f1)	FarsTail (f1/acc)	ParsiNLU - Multiple - choice (f1/acc)	ParsiNLU - Reading - comprehension (exact/f1)	ParsiNLU - QQP (f1/acc)
TookaBERT - large	85.66/85.78	69.69/94.07	75.56/88.06/70.24/87.83	89.71/89.72	36.13/35.97	33.6/60.5	82.72/82.63
TookaBERT - base	83.93/83.93	66.23/93.3	73.18/85.71/68.29/85.94	83.26/83.41	33.6/33.81	20.8/42.52	81.33/81.29
Shiraz	81.17/81.08	59.1/92.83	65.96/81.25/59.63/81.31	77.76/77.75	34.73/34.53	17.6/39.61	79.68/79.51
ParsBERT	80.22/80.23	64.91/93.23	71.41/84.21/66.29/84.57	80.89/80.94	35.34/35.25	20/39.58	80.15/80.07
XLM - V - base	83.43/83.36	58.83/92.23	73.26/85.69/68.21/85.56	81.1/81.2	35.28/35.25	8/26.66	80.1/79.96
XLM - RoBERTa - base	83.99/84.07	60.38/92.49	73.72/86.24/68.16/85.8	82.0/81.98	32.4/32.37	20.0/40.43	79.14/78.95
FaBERT	82.68/82.65	63.89/93.01	72.57/85.39/67.16/85.31	83.69/83.67	32.47/32.37	27.2/48.42	82.34/82.29
mBERT	78.57/78.66	60.31/92.54	71.79/84.68/65.89/83.99	82.69/82.82	33.41/33.09	27.2/42.18	79.19/79.29
AriaBERT	80.51/80.51	60.98/92.45	68.09/81.23/62.12/80.94	74.47/74.43	30.75/30.94	14.4/35.48	79.09/78.84

*Note: because of the randomness in the fine - tuning process, results with less than 1% differences are considered together.

📞 Contact us

If you have any questions regarding this model, you can reach us via the community of the model in Hugging Face.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご