AlephBERT-Base Open-Source Hebrew Language Model - Designed specifically for handling Hebrew text!

Alephbert Base

Developed by biu-nlp

AlephBERT is a cutting-edge language model for Hebrew, based on Google's BERT architecture, specifically designed for processing Hebrew text.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Hebrew BERT #Social media text processing #Multi-source data training

Downloads 26

Release Time : 3/2/2022

Model Overview

AlephBERT is a Hebrew language model based on the BERT architecture, primarily used for Hebrew text tasks such as text classification and named entity recognition.

Model Features

Optimized for Hebrew

The model is specifically optimized for Hebrew, enabling better processing of Hebrew text.

Based on BERT architecture

Utilizes Google's BERT architecture, offering robust language understanding and generation capabilities.

Large-scale training data

Trained on diverse Hebrew datasets including OSCAR, Wikipedia, and Twitter.

Model Capabilities

Text classification

Named entity recognition

Language understanding

Language generation

Use Cases

Natural language processing

Hebrew text classification

Used for classifying Hebrew text, such as sentiment analysis and topic classification.

Hebrew named entity recognition

Used for identifying named entities in Hebrew text, such as person names and locations.

🚀 AlephBERT

State-of-the-art Hebrew language model based on Google's BERT architecture.

🚀 Quick Start

AlephBERT is a state-of-the-art language model designed for Hebrew. It is built upon Google's BERT architecture (Devlin et al. 2018).

from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()

✨ Features

Advanced Architecture: Based on Google's BERT architecture, providing high - performance language understanding for Hebrew.
Diverse Training Data: Trained on a wide range of Hebrew datasets, including OSCAR, Wikipedia, and Twitter data.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()

📚 Documentation

Training data

OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences).
Hebrew dump of Wikipedia (650 MB text, 3 million sentences).
Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences).

Training procedure

Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.

Since the larger part of our training data is based on tweets we decided to start by optimizing using Masked Language Model loss only.

To optimize training time we split the data into 4 sections based on max number of tokens:

num tokens < 32 (70M sentences)
32 <= num tokens < 64 (12M sentences)
64 <= num tokens < 128 (10M sentences)
128 <= num tokens < 512 (1.5M sentences)

Each section was first trained for 5 epochs with an initial learning rate set to 1e - 4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e - 5, for a total of 10 epochs.

Total training time was 8 days.

🔧 Technical Details

The model is based on Google's BERT architecture. It is trained on a DGX machine with 8 V100 GPUs using the standard huggingface training procedure. The data is split into different sections based on the number of tokens to optimize training time.

📄 License

This project is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Hebrew Language Model
Training Data	OSCAR Hebrew section, Hebrew Wikipedia dump, Hebrew Tweets

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご