🚀 GPT2-small-spanish: a Language Model for Spanish text generation (and more NLP tasks...)
GPT2-small-spanish is a state-of-the-art language model for Spanish based on the GPT-2 small model. It can be used for Spanish text generation and other NLP tasks, offering high - quality language processing capabilities.
✨ Features
- Advanced Architecture: Based on the GPT-2 small model, it inherits powerful language processing capabilities.
- Transfer Learning and Fine - tuning: Trained on Spanish Wikipedia using transfer learning and fine - tuning techniques, enabling efficient adaptation to the Spanish language.
- Fastai v2 Integration: Fine - tuned using Hugging Face libraries wrapped in the fastai v2 deep learning framework, leveraging advanced fine - tuning techniques.
📦 Installation
No installation steps were provided in the original document, so this section is skipped.
📚 Documentation
Model Training
It was trained on Spanish Wikipedia. The training took around 70 hours with four GPU NVIDIA GTX 1080 - Ti with 11GB of DDR5 and around 3GB of (processed) training data. It was fine - tuned from the English pre - trained GPT - 2 small using the Hugging Face libraries (Transformers and Tokenizers) wrapped into the fastai v2 Deep Learning framework. All the fine - tuning fastai v2 techniques were used.
The training is purely based on the GPorTuguese - 2 model developed by Pierre Guillou. The training details are in this article: "Faster than training from scratch — Fine - tuning the English GPT - 2 in any language with Hugging Face and fastai v2 (practical case with Portuguese)".
Availability
This preliminary version is now available on Hugging Face.
Limitations and bias
(Copied from original GPorTuguese - 2 model) The training data used for this model come from Spanish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:
Because large - scale language models like GPT - 2 do not distinguish fact from fiction, we don’t support use - cases that require the generated text to be true. Additionally, language models like GPT - 2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use - case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT - 2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.
Authors
The model was trained and evaluated by Josué Obregon and Berny Carrera, founders of Datificate, a space for learning Machine Learning in Spanish.
The training was possible thanks to the computing power of several GPUs (GPU NVIDIA GTX1080 - Ti) of the IAI Lab (Kyung Hee University) from which Josué is attached as a Postdoctoral Researcher in Industrial Artificial Intelligence.
As stated before, this work is mainly based on the work of Pierre GUILLOU.
📄 License
The model is licensed under the Apache - 2.0 license.
Property |
Details |
Model Type |
State - of - the - art language model for Spanish based on GPT - 2 small model |
Training Data |
Spanish Wikipedia |
License |
Apache - 2.0 |