Tahrirchi-bert-base Open-source Text Model - A Great Helper for Content Encoding and Processing in Uzbek (Latin Alphabet)

Tahrirchi Bert Base

Developed by tahrirchi

TahrirchiBERT-base is an encoder-only Transformer text model for Uzbek (Latin script) with 110 million parameters, pre-trained using masked language modeling objectives.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Uzbek text processing #Masked language modeling #Latin script BERT

Downloads 88

Release Time : 10/26/2023

Model Overview

This model is pre-trained on Uzbek language and suitable for fine-tuning on tasks requiring sentence-level decisions, such as sequence classification, token classification, or question answering.

Model Features

Uzbek Language Specialization

Specifically optimized and trained for Uzbek (Latin script), enabling better understanding and generation of Uzbek text.

Case Sensitivity

The model is case-sensitive and can recognize and process text inputs with different cases.

Large-scale Pre-training Data

Pre-trained using approximately 4,000 preprocessed books and 1.2 million curated web and Telegram blog texts (equivalent to 5 billion tokens).

Model Capabilities

Fill-mask

Sequence classification

Token classification

Question answering

Use Cases

Text Processing

Uzbek Text Completion

Used to complete missing parts in Uzbek text, such as masked tokens in sentences.

Uzbek Text Classification

Used for classification tasks on Uzbek text, such as sentiment analysis or topic classification.

🚀 TahrirchiBERT base model

The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters. It is a pretrained model on the Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.

For full details of this model, please read our paper (coming soon!) and release blog post.

✨ Features

This model is part of the family of TahrirchiBERT models trained with different numbers of parameters that will be continuously expanded in the future.

Property	Details
Model Type	TahrirchiBERT models family
Training Data	Uzbek Crawl and all latin portion of Uzbek Books, which contains roughly 4000 preprocessed books, 1.2 million curated text documents scraped from the internet and Telegram blogs (equivalent to 5 billion tokens).

Model	Number of parameters	Language	Script
`tahrirchi-bert-small`	67M	Uzbek	Latin
`tahrirchi-bert-base`	110M	Uzbek	Latin

🚀 Quick Start

This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification, or question answering.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tahrirchi/tahrirchi-bert-base')
>>> unmasker("Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning <mask>, mutafakkiri va davlat arbobi bo‘lgan.")

[{'score': 0.4616584777832031,
  'token': 10879,
  'token_str': ' shoiri',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning shoiri, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.19899587333202362,
  'token': 10013,
  'token_str': ' olimi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning olimi, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.055418431758880615,
  'token': 12224,
  'token_str': ' asoschisi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning asoschisi, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.037673842161893845,
  'token': 24597,
  'token_str': ' faylasufi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning faylasufi, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.029616089537739754,
  'token': 9543,
  'token_str': ' farzandi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning farzandi, mutafakkiri va davlat arbobi bo‘lgan.'}]


>>> unmasker("Egiluvchan boʻgʻinlari va <mask>, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.")

[{'score': 0.1740381121635437,
  'token': 12571,
  'token_str': ' oyoqlari',
  'sequence': 'Egiluvchan bo‘g‘inlari va oyoqlari, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.05455964431166649,
  'token': 2073,
  'token_str': ' uzun',
  'sequence': 'Egiluvchan bo‘g‘inlari va uzun, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.050441522151231766,
  'token': 19725,
  'token_str': ' barmoqlari',
  'sequence': 'Egiluvchan bo‘g‘inlari va barmoqlari, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.04490342736244202,
  'token': 10424,
  'token_str': ' tanasi',
  'sequence': 'Egiluvchan bo‘g‘inlari va tanasi, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.03777358680963516,
  'token': 27116,
  'token_str': ' bukilgan',
  'sequence': 'Egiluvchan bo‘g‘inlari va bukilgan, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'}]

🔧 Technical Details

Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 30,528 to make full use of rare words. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. Also, a number of regular expressions were added to avoid misrepresentation of different symbols that are mostly used incorrectly in practice.

Pretraining

The model was trained for one million steps with a batch size of 512. The sequence length was limited to 512 tokens during the entire pre-training stage. The optimizer used is Adam with a learning rate of 5e-4, \(\beta_{1} = 0.9\) and \(\beta_{2} = 0.98\), a weight decay of 1e-5, learning rate warmup to the full LR for 6% of the training duration with linearly decay to 0.02x the full LR by the end of the training duration.

📄 License

This model is licensed under the Apache-2.0 license.

📚 Documentation

Please cite this model using the following format:

@online{Mamasaidov2023TahrirchiBERT,
    author    = {Mukhammadsaid Mamasaidov and Abror Shopulatov},
    title     = {TahrirchiBERT base},
    year      = {2023},
    url       = {https://huggingface.co/tahrirchi/tahrirchi-bert-base},
    note      = {Accessed: 2023-10-27}, % change this date
    urldate   = {2023-10-27} % change this date
}

💡 Acknowledgments

We are thankful to these awesome organizations and people for their help in making this happen:

MosaicML team: for their script for efficiently training BERT models
Ilya Gusev: for advice throughout the process
David Dale: for advice throughout the process

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご