t5-small-spanish-nahuatl Open-Source Translation Model - Freely Translate between Spanish and Nahuatl

T5 Small Spanish Nahuatl

Developed by somosnlp-hackathon-2022

This is a small translation model based on the T5 Transformer, specifically designed for translation tasks between Spanish and Nahuatl.

Machine Translation

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Low-resource language translation #Multi-task transfer learning #Nahuatl-specific

Downloads 795

Release Time : 3/29/2022

Model Overview

The model is optimized for Nahuatl, the most widely spoken indigenous language in Mexico, overcoming data scarcity challenges through a two-phase training strategy to effectively translate short sentences.

Model Features

Two-phase training strategy

First trained with Spanish-English data to master Spanish, then fine-tuned for Nahuatl, effectively addressing data scarcity issues.

Data normalization processing

Uses py-elotl's 'sep' method to handle Nahuatl variant issues, improving model robustness.

Multi-task learning

Combines Spanish-English and Spanish-Nahuatl dual-task training to prevent overfitting.

Model Capabilities

Spanish to Nahuatl translation

Nahuatl to Spanish translation

Handling multiple Nahuatl variants

Use Cases

Language preservation and heritage

Indigenous literature translation

Translating historical documents and poetry, such as Nahuatl cultural heritage

Can accurately translate works like the 'Black and Red Ink' poetry collection.

Educational applications

Language learning aid

Helping Spanish speakers learn Nahuatl

Can translate everyday phrases and simple sentences.

🚀 t5-small-spanish-nahuatl

This project aims to address the challenges of neural machine translation between Spanish and Nahuatl. Leveraging the T5 text - to - text prefix training strategy, it compensates for the lack of structured data and enables successful translation of short sentences.

🚀 Quick Start

Nahuatl is the most widely - spoken indigenous language in Mexico. However, training a neural network for neural machine translation is challenging due to the lack of structured data. Popular datasets like the Axolot and bible - corpus have only about 16,000 and 7,000 samples respectively. Moreover, there are multiple variants of Nahuatl, making the task even more difficult. For example, a single word from the Axolot dataset can be written in more than three different ways. So, we use the T5 text - to - text prefix training strategy to make up for the data shortage. First, we train the multilingual model to learn Spanish and then adapt it to Nahuatl. The resulting T5 Transformer can successfully translate short sentences. Finally, we report Chrf and BLEU results.

✨ Features

Data - shortage compensation: Utilizes the T5 text - to - text prefix training strategy to deal with the lack of structured data for Nahuatl.
Multilingual adaptation: Trains the model on Spanish first and then adapts it to Nahuatl.
Translation ability: Can translate short Spanish sentences to Nahuatl.

📦 Installation

Not provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')

model.eval()
sentence = 'muchas flores son blancas'
input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
outputs = model.generate(input_ids)
# outputs = miak xochitl istak
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

📚 Documentation

Model description

This model is a T5 Transformer ([t5 - small](https://huggingface.co/t5 - small)) fine - tuned on Spanish and Nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py - elotl](https://github.com/ElotlMX/py - elotl).

Approach

Dataset

Since the Axolotl corpus contains misalignments, we select the best samples (12,207). We also use the [bible - corpus](https://github.com/christos - c/bible - corpus) (7,821).

Axolotl best aligned books
Anales de Tlatelolco
Diario
Documentos nauas de la Ciudad de México del siglo XVI
Historia de México narrada en náhuatl y español
La tinta negra y roja (antología de poesía náhuatl)
Memorial Breve (Libro las ocho relaciones)
Método auto - didáctico náhuatl - español
Nican Mopohua
Quinta Relación (Libro las ocho relaciones)
Recetario Nahua de Milpa Alta D.F
Testimonios de la antigua palabra
Trece Poetas del Mundo Azteca
Una tortillita nomás - Se taxkaltsin saj
Vida económica de Tenochtitlan

We also collected 3,000 extra samples from the web to increase the data.

Model and training

We employ two training stages using a multilingual T5 - small. The advantage of this model is that it can handle different vocabularies and prefixes. T5 - small is pre - trained on different tasks and languages (French, Romanian, English, German).

Training - stage 1 (learning Spanish)

In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English - Spanish Anki dataset, which consists of 118,964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: ".

Training - stage 2 (learning Nahuatl)

We use the pre - trained Spanish - English model to learn Spanish - Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English - Spanish Anki dataset. This two - task training avoids overfitting and makes the model more robust.

Training setup

We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e - 5.

Evaluation results

We evaluate the models on the same 505 validation Nahuatl sentences for a fair comparison. Finally, we report the results using chrf and sacrebleu hugging face metrics:

English - Spanish pretraining	Validation loss	BLEU	Chrf
False	1.34	6.17	26.96
True	1.31	6.18	28.21

The English - Spanish pretraining improves BLEU and Chrf and leads to faster convergence. The evaluation is available on the [eval.ipynb](https://github.com/milmor/spanish - nahuatl - translation/blob/main/eval.ipynb) notebook.

🔧 Technical Details

The model uses the T5 text - to - text prefix training strategy. It first trains on Spanish data and then adapts to Nahuatl. The data is carefully selected from different sources and normalized. The two - stage training approach helps in dealing with the data shortage and overfitting problems.

📄 License

License: apache - 2.0

References

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified Text - to - Text transformer.
Ximena Gutierrez - Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish - Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
https://github.com/christos - c/bible - corpus
https://github.com/ElotlMX/py - elotl

Team members

Emilio Alejandro Morales (milmor)
Rodrigo Martínez Arzate (rockdrigoma)
Luis Armando Mercado (luisarmando)
Jacobo del Valle (jjdv)

Property	Details
Language	Spanish, Nahuatl, Multilingual
Model Type	T5 Transformer (t5 - small)
Training Data	Axolotl corpus (best 12,207 samples), bible - corpus (7,821 samples), 3,000 web - collected samples, English - Spanish Anki dataset (118,964 pairs), additional 20,000 English - Spanish Anki samples for Nahuatl training
License	apache - 2.0
Tags	translation

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご