alephbert-base Open-source Hebrew Language Model - Advanced and Practical, Meeting Diverse Language Needs

Alephbert Base

Developed by onlplab

The state-of-the-art Hebrew language model based on the BERT architecture

Large Language Model OtherOpen Source License:Apache-2.0 #Hebrew language processing #Social media text analysis #Multi-source data training

Downloads 25.26k

Release Time : 3/2/2022

Model Overview

AlephBERT is a Hebrew language model based on the BERT architecture, specifically designed for Hebrew text processing tasks and supports various natural language processing applications.

Model Features

Hebrew optimization

Specifically trained for Hebrew using extensive Hebrew corpora

Multi-source data training

Incorporates OSCAR corpus, Wikipedia, and Twitter data, covering diverse text types

Segmented training strategy

Divides data into 4 parts based on token count for optimized training efficiency

Model Capabilities

Hebrew text understanding

Hebrew text generation

Hebrew language modeling

Use Cases

Natural language processing

Hebrew text classification

Performing classification tasks on Hebrew text

Hebrew question answering system

Foundation model for building Hebrew question answering systems

🚀 AlephBERT

State-of-the-art language model for Hebrew, based on Google's BERT architecture.

🚀 Quick Start

AlephBERT is a state-of-the-art language model designed for Hebrew, which is built upon Google's BERT architecture. It offers high - performance language processing capabilities for Hebrew text.

from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()

✨ Features

Advanced Architecture: Based on Google's BERT architecture, it provides state - of - the - art performance for Hebrew language processing.
Diverse Training Data: Trained on a wide range of Hebrew data sources, including OSCAR, Wikipedia, and Twitter, ensuring broad language coverage.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()

Advanced Usage

The README does not provide advanced usage code examples, so this part is skipped.

📚 Documentation

Training Data

Property	Details
Training Data	1. OSCAR [(Ortiz, 2019)](https://oscar - corpus.com/) Hebrew section (10 GB text, 20 million sentences). 2. Hebrew dump of Wikipedia (650 MB text, 3 million sentences). 3. Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences).

Training Procedure

AlephBERT was trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure. Since the larger part of the training data is based on tweets, the model was initially optimized using only the Masked Language Model loss.

To optimize the training time, the data was split into 4 sections based on the maximum number of tokens:

num tokens < 32 (70M sentences)
32 <= num tokens < 64 (12M sentences)
64 <= num tokens < 128 (10M sentences)
128 <= num tokens < 512 (1.5M sentences)

Each section was first trained for 5 epochs with an initial learning rate of 1e - 4, and then for another 5 epochs with an initial learning rate of 1e - 5, for a total of 10 epochs. The total training time was 8 days.

🔧 Technical Details

The model is trained on a DGX machine with 8 V100 GPUs using the standard huggingface training procedure. The data splitting strategy based on the number of tokens helps to optimize the training time. By first using a relatively high learning rate and then a lower one, the model can converge more effectively.

📄 License

The model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご