DistilGPT2 Base Pretrained Hebrew - Open-Source Hebrew Text Generation Model

Distilgpt2 Base Pretrained He

Developed by Norod78

A compact Hebrew text generation model based on GPT2 architecture, trained on TPU and GPU

Large Language Model OtherOpen Source License:MIT #Hebrew text generation #Mini GPT2 #Multi-source data training

Downloads 1,632

Release Time : 3/2/2022

Model Overview

This is a text generation model specifically optimized for Hebrew, distilled and fine-tuned based on the GPT2 architecture, suitable for Hebrew-related natural language processing tasks

Model Features

Hebrew optimization

Specifically trained and optimized for Hebrew, capable of generating fluent Hebrew text

Distilled architecture

Distilled based on GPT2 architecture, reducing model size while maintaining performance

Multi-source training

Trained using multiple Hebrew data sources including OSCAR corpus, CC-100, Twitter, and Wikipedia

Model Capabilities

Hebrew text generation

Context understanding

Language model fine-tuning

Use Cases

Content creation

Story continuation

Continue a story based on a given Hebrew opening

Example shows generated results for 'The last person on Earth sat alone in a room when suddenly there was a knock at the door'

Dialogue systems

Dialogue generation

Generate Hebrew dialogue responses

Example shows generated dialogue starting with 'Hello, my name is'

🚀 distilgpt2-base-pretrained-he

A tiny GPT2 based Hebrew text generation model. It was initially trained on a TPUv3-8 made available via the TPU Research Cloud Program and then further fine - tuned on GPU.

🚀 Quick Start

This is a Hebrew text generation model based on GPT2. It has been trained on multiple datasets and can be used to generate Hebrew text.

✨ Features

Trained on multiple Hebrew datasets including oscar, CC - 100, Hebrew Twitter, Wikipedia and other sources.
Initially trained on TPUv3 - 8 and further fine - tuned on GPU.

📦 Installation

The installation steps are not provided in the original README. You may need to refer to the official documentation of the transformers library to install the necessary dependencies.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

def main():
    model_name="Norod78/distilgpt2-base-pretrained-he"

    prompt_text = "שלום, קוראים לי"
    generated_max_length = 192

    print("Loading model...")
    model =  AutoModelForCausalLM.from_pretrained(model_name)
    print('Loading Tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    text_generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

    print("Generating text...")
    result = text_generator(prompt_text, num_return_sequences=1, batch_size=1, do_sample=True, top_k=40, top_p=0.92, temperature = 1, repetition_penalty=5.0, max_length = generated_max_length)

    print("result = " + str(result))

if __name__ == '__main__':
    main()

📚 Documentation

Dataset

oscar (unshuffled deduplicated he) - Homepage | Dataset Permalink

The Open Super - large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

CC - 100 (he) - HomePage

This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC - Net repository by processing January - December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double - newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC - Net repository.

Misc

Hebrew Twitter
Wikipedia
Various other sources

Training

Done on a TPUv3 - 8 VM using [Huggingface's clm - flax example script](https://github.com/huggingface/transformers/blob/master/examples/flax/language - modeling/run_clm_flax.py)
A list of items to make it easier for others to use this script was posted to This discussion forum
Further training was performed on GPU

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご