🚀 TookaBERT Model
TookaBERT models are a family of encoder models designed for Persian language processing. They come in two sizes, base and large, and are pre - trained on over 500GB of diverse Persian data, covering various topics like news, blogs, forums, and books. These models are pre - trained with the MLM (WWM) objective using two context lengths.
🚀 Quick Start
Model Details
TookaBERT models are a family of encoder models trained on Persian in two sizes: base and large. These models were pre - trained on over 500GB of Persian data, including a wide variety of topics such as News, Blogs, Forums, Books, etc. They were pre - trained with the MLM (WWM) objective using two context lengths.
For more information, you can read our paper on arXiv.
✨ Features
- Trained on a large corpus of Persian data (over 500GB).
- Available in two sizes: base and large.
- Pre - trained with the MLM (WWM) objective.
💻 Usage Examples
Basic Usage
You can use this model directly for Masked Language Modeling using the provided code below.
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("PartAI/TookaBERT-Base")
model = AutoModelForMaskedLM.from_pretrained("PartAI/TookaBERT-Base")
text = "شهر برلین در کشور <mask> واقع شده است."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Advanced Usage
It is also possible to use inference pipelines such as below.
from transformers import pipeline
inference_pipeline = pipeline('fill-mask', model="PartAI/TookaBERT-Base")
inference_pipeline("شهر برلین در کشور <mask> واقع شده است.")
You can use this model to fine - tune it over your dataset and prepare it for your task.
- DeepSentiPers (Sentiment Analysis)

- ParsiNLU - Multiple - choice (Multiple - choice)

📚 Documentation
Evaluation
TookaBERT models are evaluated on a wide range of NLP downstream tasks, such as Sentiment Analysis (SA), Text Classification, Multiple - choice, Question Answering, and Named Entity Recognition (NER).
Here are some key performance results:
Model name |
DeepSentiPers (f1/acc) |
MultiCoNER - v2 (f1/acc) |
PQuAD (best_exact/best_f1/HasAns_exact/HasAns_f1) |
FarsTail (f1/acc) |
ParsiNLU - Multiple - choice (f1/acc) |
ParsiNLU - Reading - comprehension (exact/f1) |
ParsiNLU - QQP (f1/acc) |
TookaBERT - large |
85.66/85.78 |
69.69/94.07 |
75.56/88.06/70.24/87.83 |
89.71/89.72 |
36.13/35.97 |
33.6/60.5 |
82.72/82.63 |
TookaBERT - base |
83.93/83.93 |
66.23/93.3 |
73.18/85.71/68.29/85.94 |
83.26/83.41 |
33.6/33.81 |
20.8/42.52 |
81.33/81.29 |
Shiraz |
81.17/81.08 |
59.1/92.83 |
65.96/81.25/59.63/81.31 |
77.76/77.75 |
34.73/34.53 |
17.6/39.61 |
79.68/79.51 |
ParsBERT |
80.22/80.23 |
64.91/93.23 |
71.41/84.21/66.29/84.57 |
80.89/80.94 |
35.34/35.25 |
20/39.58 |
80.15/80.07 |
XLM - V - base |
83.43/83.36 |
58.83/92.23 |
73.26/85.69/68.21/85.56 |
81.1/81.2 |
35.28/35.25 |
8/26.66 |
80.1/79.96 |
XLM - RoBERTa - base |
83.99/84.07 |
60.38/92.49 |
73.72/86.24/68.16/85.8 |
82.0/81.98 |
32.4/32.37 |
20.0/40.43 |
79.14/78.95 |
FaBERT |
82.68/82.65 |
63.89/93.01 |
72.57/85.39/67.16/85.31 |
83.69/83.67 |
32.47/32.37 |
27.2/48.42 |
82.34/82.29 |
mBERT |
78.57/78.66 |
60.31/92.54 |
71.79/84.68/65.89/83.99 |
82.69/82.82 |
33.41/33.09 |
27.2/42.18 |
79.19/79.29 |
AriaBERT |
80.51/80.51 |
60.98/92.45 |
68.09/81.23/62.12/80.94 |
74.47/74.43 |
30.75/30.94 |
14.4/35.48 |
79.09/78.84 |
*Note: because of the randomness in the fine - tuning process, results with less than 1% differences are considered together.
📄 License
This model is licensed under the Apache - 2.0 license.
🔧 Technical Details
The model is trained on Persian data using the MLM (WWM) objective with two context lengths. It comes in two sizes, base and large, to suit different application requirements.
📞 Contact us
If you have any questions regarding this model, you can reach us via the community of the model in Hugging Face.