🚀 GPT2-Spanish
GPT2-Spanish is a language generation model trained from scratch using 11.5GB of Spanish texts. It employs a Byte Pair Encoding (BPE) tokenizer specifically trained for this task, with parameters identical to the medium version of the original OpenAI GPT2 model.
✨ Features
- Trained from scratch with 11.5GB of Spanish texts.
- Uses a Byte Pair Encoding (BPE) tokenizer trained for Spanish.
- Parameters match the medium version of the original OpenAI GPT2 model.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples are provided in the original document, so this section is skipped.
📚 Documentation
Corpus
This model was trained with a 11.5GB text corpus, which includes 3.5GB of Wikipedia articles and 8GB of books (covering narrative, short stories, theater, poetry, essays, and popularization).
Tokenizer
The texts are tokenized using a byte - level version of Byte Pair Encoding (BPE) (for Unicode characters) and have a vocabulary size of 50257. The inputs are sequences of 1024 consecutive tokens.
This tokenizer was trained from scratch with the Spanish corpus because it was found that the tokenizers of English models had limitations in capturing the semantic relationships of Spanish due to the morphosyntactic differences between the two languages.
In addition to the special token "<|endoftext|>" for text ending in the OpenAI GPT - 2 models, the tokens "<|talk|>", "<|ax1|>", "<|ax2|>" (...)"<|ax9|>" were included to serve as prompts in future training.
Training
The model and tokenizer were trained using the Hugging Face libraries on Google Colab servers with an Nvidia Tesla V100 GPU having 16GB of memory.
🔧 Technical Details
The model and tokenizer were trained from scratch. The training data consists of 11.5GB of Spanish texts, including Wikipedia articles and various types of books. The Byte Pair Encoding (BPE) tokenizer was specifically trained for Spanish to address the morphosyntactic differences between Spanish and English. The training was carried out using Hugging Face libraries on Google Colab servers with an Nvidia Tesla V100 GPU.
📄 License
This project is licensed under the MIT license.
👥 Authors
The model was trained by Alejandro Oñate Latorre (Spain) and Jorge Ortiz Fuentes (Chile), members of -Deep ESP-, an open - source community on Natural Language Processing in Spanish (https://t.me/joinchat/VoEp1bPrDYEexc6h).
Thanks to the community members who contributed funding for the initial tests.
⚠️ Important Note
The model generates texts according to the patterns learned in the training corpus. Since these data were not filtered, the model may generate offensive or discriminatory content.
Property |
Details |
Model Type |
Language generation model |
Training Data |
11.5GB of Spanish texts (3.5GB of Wikipedia articles and 8GB of books) |
Tokenizer |
Byte - level Byte Pair Encoding (BPE) with a vocabulary size of 50257 |