đ AlephBERT
State-of-the-art language model for Hebrew, based on Google's BERT architecture.
đ Quick Start
AlephBERT is a state-of-the-art language model designed for Hebrew, which is built upon Google's BERT architecture. It offers high - performance language processing capabilities for Hebrew text.
from transformers import BertModel, BertTokenizerFast
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
alephbert.eval()
⨠Features
- Advanced Architecture: Based on Google's BERT architecture, it provides state - of - the - art performance for Hebrew language processing.
- Diverse Training Data: Trained on a wide range of Hebrew data sources, including OSCAR, Wikipedia, and Twitter, ensuring broad language coverage.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import BertModel, BertTokenizerFast
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
alephbert.eval()
Advanced Usage
The README does not provide advanced usage code examples, so this part is skipped.
đ Documentation
Training Data
Property |
Details |
Training Data |
1. OSCAR [(Ortiz, 2019)](https://oscar - corpus.com/) Hebrew section (10 GB text, 20 million sentences). 2. Hebrew dump of Wikipedia (650 MB text, 3 million sentences). 3. Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences). |
Training Procedure
AlephBERT was trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure. Since the larger part of the training data is based on tweets, the model was initially optimized using only the Masked Language Model loss.
To optimize the training time, the data was split into 4 sections based on the maximum number of tokens:
- num tokens < 32 (70M sentences)
- 32 <= num tokens < 64 (12M sentences)
- 64 <= num tokens < 128 (10M sentences)
- 128 <= num tokens < 512 (1.5M sentences)
Each section was first trained for 5 epochs with an initial learning rate of 1e - 4, and then for another 5 epochs with an initial learning rate of 1e - 5, for a total of 10 epochs. The total training time was 8 days.
đ§ Technical Details
The model is trained on a DGX machine with 8 V100 GPUs using the standard huggingface training procedure. The data splitting strategy based on the number of tokens helps to optimize the training time. By first using a relatively high learning rate and then a lower one, the model can converge more effectively.
đ License
The model is licensed under the Apache - 2.0 license.