t5s - spanish - qg开源模型 - 免费生成西班牙语问题，助力西班牙语学习应用

首页

T5s Spanish Qg

由 benjleite 开发

基于T5的西班牙语问题生成模型，从FairytaleQA数据集的西班牙语机器翻译版本微调而来

问答系统

Transformers

西班牙语开源协议:Apache-2.0 #西班牙语问题生成 #教育叙事理解 #T5微调模型

下载量 50

发布时间 : 6/18/2024

模型简介

该模型专门用于从西班牙语文本生成教育相关问题，支持对叙事元素的理解，适用于K-8年级教育场景

模型特点

教育领域优化

专门针对儿童故事内容的问题生成进行微调，适合教育应用

叙事元素理解

能够处理七种不同类型的叙事元素和关系

多组件输入处理

使用特殊标记区分答案、文本等不同输入组件

模型能力

西班牙语文本理解

教育问题生成

叙事元素分析

使用案例

教育技术

阅读理解辅助

为西班牙语儿童故事自动生成理解性问题

帮助学生提高叙事理解能力

教育内容开发

为教师自动生成教学材料和评估问题

节省教师准备时间，提供多样化问题

🚀 t5s-spanish-qg 模型卡片

t5s-spanish-qg 是一个基于 T5 的模型，它在西班牙语的机器翻译版本的原始英文 FairytaleQA 数据集上，从 T5S 微调而来。该模型的微调任务是问题生成。你可以查看我们已被 ECTEL 2024 接受的论文。

✨ 主要特性

基于 T5 架构，在西班牙语数据集上进行微调，适用于问题生成任务。
利用特殊标签区分输入组件，优化问题生成过程。
训练过程中采用了最大 20 个 epoch 和早停策略，提高训练效率。
推理时使用束搜索，提升生成问题的质量。

📦 安装指南

本部分未提及具体安装步骤，因此跳过。

💻 使用示例

基础用法

>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qg")
>>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)

重要提示：需要添加特殊标记并调整模型标记的大小：

>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))

高级用法

input_text = '<respuesta>' + 'Un Oso.' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'

source_encoding = tokenizer(
    input_text,
    max_length=512,
    padding='max_length',
    truncation = 'only_second',
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors='pt'
)
    
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    num_return_sequences=1,
    num_beams=5,
    max_length=512,
    repetition_penalty=1.0,
    length_penalty=1.0,
    early_stopping=True,
    use_cache=True
)

prediction = {
    tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
    for generated_id in generated_ids
}

generated_str = ''.join(preds)

print(generated_str)

使用建议：更多代码细节请查看我们的仓库。

📚 详细文档

训练数据

FairytaleQA 是一个开源数据集，旨在提高对叙事的理解，目标用户是幼儿园到八年级的学生。该数据集由教育专家根据循证理论框架精心注释而成。它包含从 278 个适合儿童的故事中提取的 10,580 个显式和隐式问题，涵盖七种类型的叙事元素或关系。

评估 - 问题生成

模型	ROUGEL-F1
t5（针对原始英文数据集，基线模型）	0.530
t5s-spanish-qg（针对西班牙语机器翻译数据集）	0.445

🔧 技术细节

编码器将答案和文本进行拼接，解码器生成问题。我们使用特殊标签来区分各个组件。最大输入标记设置为 512，最大输出标记设置为 128。在训练期间，模型最多进行 20 个 epoch 的训练，并采用早停策略，耐心值为 2。使用的批量大小为 16。在推理时，我们使用束宽为 5 的束搜索。

📄 许可证

此微调模型根据 Apache-2.0 许可证发布。

引用信息

本项目论文

@article{leite_fairytaleqa_translated_2024,
        title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, 
        author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
        year={2024},
        eprint={2406.04233},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
}

原始 FairytaleQA 论文

@inproceedings{xu-etal-2022-fantastic,
    title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
    author = "Xu, Ying  and
      Wang, Dakuo  and
      Yu, Mo  and
      Ritchie, Daniel  and
      Yao, Bingsheng  and
      Wu, Tongshuang  and
      Zhang, Zheng  and
      Li, Toby  and
      Bradford, Nora  and
      Sun, Branda  and
      Hoang, Tran  and
      Sang, Yisi  and
      Hou, Yufang  and
      Ma, Xiaojuan  and
      Yang, Diyi  and
      Peng, Nanyun  and
      Yu, Zhou  and
      Warschauer, Mark",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.34",
    doi = "10.18653/v1/2022.acl-long.34",
    pages = "447--460",
    abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}

T5S 模型论文

@inproceedings{araujo-etal-2024-sequence-sequence,
    title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
    author = "Araujo, Vladimir  and
      Trusca, Maria Mihaela  and
      Tufi{\~n}o, Rodrigo  and
      Moens, Marie-Francine",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1283",
    pages = "14729--14743",
    abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
}