š TahrirchiBERT base model
The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters. It is a pretrained model on the Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.
For full details of this model, please read our paper (coming soon!) and release blog post.
⨠Features
This model is part of the family of TahrirchiBERT models trained with different numbers of parameters that will be continuously expanded in the future.
Property |
Details |
Model Type |
TahrirchiBERT models family |
Training Data |
Uzbek Crawl and all latin portion of Uzbek Books, which contains roughly 4000 preprocessed books, 1.2 million curated text documents scraped from the internet and Telegram blogs (equivalent to 5 billion tokens). |
š Quick Start
This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering.
š» Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tahrirchi/tahrirchi-bert-base')
>>> unmasker("Alisher Navoiy ā ulugā oāzbek va boshqa turkiy xalqlarning <mask>, mutafakkiri va davlat arbobi boālgan.")
[{'score': 0.4616584777832031,
'token': 10879,
'token_str': ' shoiri',
'sequence': 'Alisher Navoiy ā ulugā oāzbek va boshqa turkiy xalqlarning shoiri, mutafakkiri va davlat arbobi boālgan.'},
{'score': 0.19899587333202362,
'token': 10013,
'token_str': ' olimi',
'sequence': 'Alisher Navoiy ā ulugā oāzbek va boshqa turkiy xalqlarning olimi, mutafakkiri va davlat arbobi boālgan.'},
{'score': 0.055418431758880615,
'token': 12224,
'token_str': ' asoschisi',
'sequence': 'Alisher Navoiy ā ulugā oāzbek va boshqa turkiy xalqlarning asoschisi, mutafakkiri va davlat arbobi boālgan.'},
{'score': 0.037673842161893845,
'token': 24597,
'token_str': ' faylasufi',
'sequence': 'Alisher Navoiy ā ulugā oāzbek va boshqa turkiy xalqlarning faylasufi, mutafakkiri va davlat arbobi boālgan.'},
{'score': 0.029616089537739754,
'token': 9543,
'token_str': ' farzandi',
'sequence': 'Alisher Navoiy ā ulugā oāzbek va boshqa turkiy xalqlarning farzandi, mutafakkiri va davlat arbobi boālgan.'}]
>>> unmasker("Egiluvchan boʻgʻinlari va <mask>, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.")
[{'score': 0.1740381121635437,
'token': 12571,
'token_str': ' oyoqlari',
'sequence': 'Egiluvchan boāgāinlari va oyoqlari, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
{'score': 0.05455964431166649,
'token': 2073,
'token_str': ' uzun',
'sequence': 'Egiluvchan boāgāinlari va uzun, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
{'score': 0.050441522151231766,
'token': 19725,
'token_str': ' barmoqlari',
'sequence': 'Egiluvchan boāgāinlari va barmoqlari, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
{'score': 0.04490342736244202,
'token': 10424,
'token_str': ' tanasi',
'sequence': 'Egiluvchan boāgāinlari va tanasi, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
{'score': 0.03777358680963516,
'token': 27116,
'token_str': ' bukilgan',
'sequence': 'Egiluvchan boāgāinlari va bukilgan, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'}]
š§ Technical Details
Preprocessing
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 30,528 to make full use of rare words. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. Also, a number of regular expressions were added to avoid misrepresentation of different symbols that are mostly used incorrectly in practice.
Pretraining
The model was trained for one million steps with a batch size of 512. The sequence length was limited to 512 tokens during the entire pre-training stage. The optimizer used is Adam with a learning rate of 5e-4, \(\beta_{1} = 0.9\) and \(\beta_{2} = 0.98\), a weight decay of 1e-5, learning rate warmup to the full LR for 6% of the training duration with linearly decay to 0.02x the full LR by the end of the training duration.
š License
This model is licensed under the Apache-2.0 license.
š Documentation
Please cite this model using the following format:
@online{Mamasaidov2023TahrirchiBERT,
author = {Mukhammadsaid Mamasaidov and Abror Shopulatov},
title = {TahrirchiBERT base},
year = {2023},
url = {https://huggingface.co/tahrirchi/tahrirchi-bert-base},
note = {Accessed: 2023-10-27}, % change this date
urldate = {2023-10-27} % change this date
}
š” Acknowledgments
We are thankful to these awesome organizations and people for their help in making this happen: