đ AlephBERT
State-of-the-art Hebrew language model based on Google's BERT architecture.
đ Quick Start
AlephBERT is a state-of-the-art language model designed for Hebrew. It is built upon Google's BERT architecture (Devlin et al. 2018).
from transformers import BertModel, BertTokenizerFast
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
alephbert.eval()
⨠Features
- Advanced Architecture: Based on Google's BERT architecture, providing high - performance language understanding for Hebrew.
- Diverse Training Data: Trained on a wide range of Hebrew datasets, including OSCAR, Wikipedia, and Twitter data.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import BertModel, BertTokenizerFast
alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
alephbert.eval()
đ Documentation
Training data
- OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences).
- Hebrew dump of Wikipedia (650 MB text, 3 million sentences).
- Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences).
Training procedure
Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.
Since the larger part of our training data is based on tweets we decided to start by optimizing using Masked Language Model loss only.
To optimize training time we split the data into 4 sections based on max number of tokens:
- num tokens < 32 (70M sentences)
- 32 <= num tokens < 64 (12M sentences)
- 64 <= num tokens < 128 (10M sentences)
- 128 <= num tokens < 512 (1.5M sentences)
Each section was first trained for 5 epochs with an initial learning rate set to 1e - 4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e - 5, for a total of 10 epochs.
Total training time was 8 days.
đ§ Technical Details
The model is based on Google's BERT architecture. It is trained on a DGX machine with 8 V100 GPUs using the standard huggingface training procedure. The data is split into different sections based on the number of tokens to optimize training time.
đ License
This project is licensed under the Apache - 2.0 license.
Property |
Details |
Model Type |
Hebrew Language Model |
Training Data |
OSCAR Hebrew section, Hebrew Wikipedia dump, Hebrew Tweets |