đ distilgpt2-base-pretrained-he
A tiny GPT2 based Hebrew text generation model. It was initially trained on a TPUv3-8 made available via the TPU Research Cloud Program and then further fine - tuned on GPU.
đ Quick Start
This is a Hebrew text generation model based on GPT2. It has been trained on multiple datasets and can be used to generate Hebrew text.
⨠Features
- Trained on multiple Hebrew datasets including oscar, CC - 100, Hebrew Twitter, Wikipedia and other sources.
- Initially trained on TPUv3 - 8 and further fine - tuned on GPU.
đĻ Installation
The installation steps are not provided in the original README. You may need to refer to the official documentation of the transformers
library to install the necessary dependencies.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
def main():
model_name="Norod78/distilgpt2-base-pretrained-he"
prompt_text = "׊×××, ×§×ר××× ××"
generated_max_length = 192
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_name)
print('Loading Tokenizer...')
tokenizer = AutoTokenizer.from_pretrained(model_name)
text_generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
print("Generating text...")
result = text_generator(prompt_text, num_return_sequences=1, batch_size=1, do_sample=True, top_k=40, top_p=0.92, temperature = 1, repetition_penalty=5.0, max_length = generated_max_length)
print("result = " + str(result))
if __name__ == '__main__':
main()
đ Documentation
Dataset
The Open Super - large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC - Net repository by processing January - December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double - newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC - Net repository.
Misc
- Hebrew Twitter
- Wikipedia
- Various other sources
Training
- Done on a TPUv3 - 8 VM using [Huggingface's clm - flax example script](https://github.com/huggingface/transformers/blob/master/examples/flax/language - modeling/run_clm_flax.py)
- A list of items to make it easier for others to use this script was posted to This discussion forum
- Further training was performed on GPU
đ License
This project is licensed under the MIT license.