T5S - Spanish Question Generation (QG) Open - source Model - Generate Spanish questions for free to support Spanish learning applications

T5s Spanish Qg

Developed by benjleite

A T5-based Spanish question generation model, fine-tuned from the machine-translated Spanish version of the FairytaleQA dataset

Question Answering System

Transformers

SpanishOpen Source License:Apache-2.0 #Spanish Question Generation #Educational Narrative Comprehension #T5 Fine-tuned Model

Downloads 50

Release Time : 6/18/2024

Model Overview

This model is specifically designed to generate educational questions from Spanish texts, supporting the comprehension of narrative elements, suitable for K-8 educational scenarios

Model Features

Education Domain Optimization

Fine-tuned specifically for question generation from children's story content, suitable for educational applications

Narrative Element Comprehension

Capable of processing seven different types of narrative elements and relationships

Multi-component Input Processing

Uses special tokens to distinguish different input components such as answers and text

Model Capabilities

Spanish Text Comprehension

Educational Question Generation

Narrative Element Analysis

Use Cases

Educational Technology

Reading Comprehension Assistance

Automatically generates comprehension questions for Spanish children's stories

Helps students improve narrative comprehension skills

Educational Content Development

Automatically generates teaching materials and assessment questions for teachers

Saves teachers' preparation time and provides diverse questions

🚀 t5s-spanish-qg Model Card

t5s-spanish-qg is a T5-based model designed for text generation, specifically question generation in Spanish. It is fine-tuned from T5S using a machine-translated Spanish version of the original English FairytaleQA dataset. You can find more details in our paper, which has been accepted at ECTEL 2024.

✨ Features

Fine-tuned for Spanish: Trained on a Spanish machine-translated dataset, making it suitable for Spanish question generation tasks.
Question Generation Task: Specifically designed to generate questions from given text and answers.
Based on T5 Architecture: Leverages the powerful T5 model architecture for sequence-to-sequence tasks.

📦 Installation

To use the t5s-spanish-qg model, you need to install the transformers library if you haven't already. You can install it using pip:

pip install transformers

💻 Usage Examples

Basic Usage

>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qg")
>>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)

Important Note: Special tokens need to be added and model tokens must be resized:

>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))

Advanced Usage

input_text = '<respuesta>' + 'Un Oso.' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'

source_encoding = tokenizer(
    input_text,
    max_length=512,
    padding='max_length',
    truncation = 'only_second',
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors='pt'
)
    
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    num_return_sequences=1,
    num_beams=5,
    max_length=512,
    repetition_penalty=1.0,
    length_penalty=1.0,
    early_stopping=True,
    use_cache=True
)

prediction = {
    tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
    for generated_id in generated_ids
}

generated_str = ''.join(prediction)

print(generated_str)

Note: See our repository for additional code details.

📚 Documentation

Model Description

The t5s-spanish-qg model is fine-tuned from T5S on a Spanish machine-translated version of the FairytaleQA dataset. The fine-tuning task is Question Generation.

Training Data

FairytaleQA is an open-source dataset aimed at improving narrative comprehension for students from kindergarten to eighth grade. It contains 10,580 explicit and implicit questions from 278 child-friendly stories, covering seven types of narrative elements or relations. The dataset is carefully annotated by education experts based on an evidence-based theoretical framework.

Implementation Details

The encoder concatenates the answer and text, and the decoder generates the question. Special labels are used to distinguish the components. The maximum token input is set to 512, and the maximum token output is set to 128. During training, the models are trained for a maximum of 20 epochs with early stopping (patience of 2) and a batch size of 16. During inference, beam search with a beam width of 5 is used.

Evaluation - Question Generation

Model	ROUGEL-F1
t5 (for original english dataset, baseline)	0.530
t5s-spanish-qg (for the spanish machine-translated dataset)	0.445

🔧 Technical Details

Encoder-Decoder Structure: The model uses an encoder-decoder architecture, where the encoder processes the input (answer and text) and the decoder generates the output (question).
Tokenization: Special tokens are used to mark different components of the input. The maximum token input is 512, and the maximum token output is 128.
Training Parameters: The models are trained for a maximum of 20 epochs with early stopping (patience of 2) and a batch size of 16.
Inference: Beam search with a beam width of 5 is used during inference.

📄 License

This fine-tuned model is released under the Apache-2.0 License.

Citation Information

If you use this model or the related research, please cite the following papers:

Our paper:

@article{leite_fairytaleqa_translated_2024,
        title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, 
        author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
        year={2024},
        eprint={2406.04233},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
}

Original FairytaleQA paper:

@inproceedings{xu-etal-2022-fantastic,
    title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
    author = "Xu, Ying  and
      Wang, Dakuo  and
      Yu, Mo  and
      Ritchie, Daniel  and
      Yao, Bingsheng  and
      Wu, Tongshuang  and
      Zhang, Zheng  and
      Li, Toby  and
      Bradford, Nora  and
      Sun, Branda  and
      Hoang, Tran  and
      Sang, Yisi  and
      Hou, Yufang  and
      Ma, Xiaojuan  and
      Yang, Diyi  and
      Peng, Nanyun  and
      Yu, Zhou  and
      Warschauer, Mark",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.34",
    doi = "10.18653/v1/2022.acl-long.34",
    pages = "447--460",
    abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}

T5S model:

@inproceedings{araujo-etal-2024-sequence-sequence,
    title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
    author = "Araujo, Vladimir  and
      Trusca, Maria Mihaela  and
      Tufi{\~n}o, Rodrigo  and
      Moens, Marie-Francine",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1283",
    pages = "14729--14743",
    abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご