GPT2 Spanish Version: An Open-source Language Generation Model - Trained on a Large Amount of Text to Generate Authentic Spanish Content

Gpt2 Spanish Medium

Developed by DeepESP

GPT2-Spanish Version is a language generation model trained from scratch using 11.5GB of Spanish text, with a specially trained Byte Pair Encoding (BPE) tokenizer.

Large Language Model Supports Multiple LanguagesOpen Source License:MIT #Spanish text generation #E-book corpus training #Specialized BPE tokenizer

Downloads 221

Release Time : 3/2/2022

Model Overview

This model is a medium-sized GPT-2 version trained on Spanish text, primarily designed for Spanish text generation tasks.

Model Features

Specially trained Spanish tokenizer

The tokenizer is entirely trained on Spanish corpus, avoiding the semantic capture limitations of using an English tokenizer.

Rich training corpus

Utilizes 11.5GB of Spanish text, including 3.5GB of Wikipedia and 8GB of various books.

Added special tokens

In addition to standard end tokens, special tokens like '<|talk|>' have been added for easier subsequent training use.

Model Capabilities

Spanish text generation

Long text generation (supports context of up to 1024 tokens)

Use Cases

Content creation

Novel continuation

Generate subsequent plotlines based on a given Spanish novel opening.

Poetry creation

Generate Spanish poetry.

Education

Language learning assistance

Generate Spanish learning materials or practice texts.

🚀 GPT2-Spanish

GPT2-Spanish is a language generation model trained from scratch using 11.5GB of Spanish texts. It employs a Byte Pair Encoding (BPE) tokenizer specifically trained for this task, with parameters identical to the medium version of the original OpenAI GPT2 model.

✨ Features

Trained from scratch with 11.5GB of Spanish texts.
Uses a Byte Pair Encoding (BPE) tokenizer trained for Spanish.
Parameters match the medium version of the original OpenAI GPT2 model.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Corpus

This model was trained with a 11.5GB text corpus, which includes 3.5GB of Wikipedia articles and 8GB of books (covering narrative, short stories, theater, poetry, essays, and popularization).

Tokenizer

The texts are tokenized using a byte - level version of Byte Pair Encoding (BPE) (for Unicode characters) and have a vocabulary size of 50257. The inputs are sequences of 1024 consecutive tokens.

This tokenizer was trained from scratch with the Spanish corpus because it was found that the tokenizers of English models had limitations in capturing the semantic relationships of Spanish due to the morphosyntactic differences between the two languages.

In addition to the special token "<|endoftext|>" for text ending in the OpenAI GPT - 2 models, the tokens "<|talk|>", "<|ax1|>", "<|ax2|>" (...)"<|ax9|>" were included to serve as prompts in future training.

Training

The model and tokenizer were trained using the Hugging Face libraries on Google Colab servers with an Nvidia Tesla V100 GPU having 16GB of memory.

🔧 Technical Details

The model and tokenizer were trained from scratch. The training data consists of 11.5GB of Spanish texts, including Wikipedia articles and various types of books. The Byte Pair Encoding (BPE) tokenizer was specifically trained for Spanish to address the morphosyntactic differences between Spanish and English. The training was carried out using Hugging Face libraries on Google Colab servers with an Nvidia Tesla V100 GPU.

📄 License

This project is licensed under the MIT license.

👥 Authors

The model was trained by Alejandro Oñate Latorre (Spain) and Jorge Ortiz Fuentes (Chile), members of -Deep ESP-, an open - source community on Natural Language Processing in Spanish (https://t.me/joinchat/VoEp1bPrDYEexc6h).

Thanks to the community members who contributed funding for the initial tests.

⚠️ Important Note

The model generates texts according to the patterns learned in the training corpus. Since these data were not filtered, the model may generate offensive or discriminatory content.

Property	Details
Model Type	Language generation model
Training Data	11.5GB of Spanish texts (3.5GB of Wikipedia articles and 8GB of books)
Tokenizer	Byte - level Byte Pair Encoding (BPE) with a vocabulary size of 50257

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご