🚀 KyrgyzBert
KyrgyzBert is a small - scale BERT - based language model. It's pre - trained on a large Kyrgyz text corpus and designed for masked language modeling, text classification, and various Kyrgyz NLP applications. This model aims to boost Kyrgyz NLP research and practical uses.
🚀 Quick Start
You can load the model using Hugging Face's transformers
library:
from transformers import BertTokenizerFast, BertForMaskedLM
import torch
model_name = "metinovadilet/KyrgyzBert"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
text = "Бул жерден [MASK] нерселерди таба аласыз."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1].item()
probs = torch.softmax(outputs[0, masked_index], dim=-1)
top_k = torch.topk(probs, k=5)
predicted_tokens = [tokenizer.decode([token_id]) for token_id in top_k.indices.tolist()]
print(f"Top predictions for [MASK]: {', '.join(predicted_tokens)}")
✨ Features
- Small - scale BERT - based: KyrgyzBert is a small - scale variant of BERT, suitable for various Kyrgyz NLP tasks.
- Multiple Applications: It can be used for masked language modeling, text classification, text completion, feature extraction, and fine - tuning for specific Kyrgyz NLP tasks.
📦 Installation
The model can be installed through Hugging Face's transformers
library. You can use the following command to install the library if not already installed:
pip install transformers
📚 Documentation
Model Details
Property |
Details |
Model Type |
BERT (small - scale variant) |
Vocabulary Size |
Custom Kyrgyz tokenizer |
Hidden Size |
512 |
Number of Layers |
6 |
Attention Heads |
8 |
Intermediate Size |
2048 |
Max Sequence Length |
512 |
Pretraining Task |
Masked Language Modeling (MLM) |
Framework |
Hugging Face Transformers |
Training Data
This model was trained on a non - disclosable dataset containing over 1.5 million sentences. The dataset was tokenized using a metinovadilet/bert - kyrgyz - tokenizer.
Training Setup
- Hardware: Trained on an RTX 3090 GPU
- Batch Size: 16
- Optimizer: AdamW
- Learning Rate: 1e - 4
- Weight Decay: 0.01
- Training Epochs: 1000
Intended Use
- Text Completion & Prediction: Filling in missing words in Kyrgyz text.
- Feature Extraction: Kyrgyz word embeddings for downstream NLP tasks.
- Fine - Tuning: Can be fine - tuned for Kyrgyz sentiment analysis, named entity recognition (NER), machine translation, etc.
Limitations
- The model may struggle with low - resource dialects and code - switching.
- Performance depends on the quality and diversity of training data.
- It is not fine - tuned for specific tasks like sentiment analysis or NER.
🔧 Technical Details
KyrgyzBert is pre - trained on a large Kyrgyz text corpus using the masked language modeling (MLM) task. The small - scale BERT architecture with specific hyperparameters such as a hidden size of 512, 6 layers, 8 attention heads, etc., is optimized for Kyrgyz NLP tasks. The custom Kyrgyz tokenizer is used to tokenize the training data, which helps the model better understand the Kyrgyz language structure.
📄 License
This model is released under the Apache 2.0 License.
Acknowledgments
This model was developed by Metinov Adilet. If you use this model, please consider citing our work.
Citation
If you use this model in your research, please cite:
@misc{metinovadilet2025kyrgyzbert,
author = {Metinov Adilet},
title = {KyrgyzBert: A Small BERT Model for the Kyrgyz Language},
year = {2025},
howpublished = {Hugging Face},
url = {https://huggingface.co/metinovadilet/KyrgyzBert}
}
Contact
For questions, reach out to Metinov Adilet via Hugging Face or email - metinovadilet@gmail.com
Ulutsoft collaboration