t5s - spanish - qg開源模型 - 免費生成西班牙語問題，助力西班牙語學習應用

首頁

T5s Spanish Qg

由benjleite開發

基於T5的西班牙語問題生成模型，從FairytaleQA數據集的西班牙語機器翻譯版本微調而來

問答系統

Transformers

西班牙語開源協議:Apache-2.0 #西班牙語問題生成 #教育敘事理解 #T5微調模型

下載量 50

發布時間 : 6/18/2024

模型概述

該模型專門用於從西班牙語文本生成教育相關問題，支持對敘事元素的理解，適用於K-8年級教育場景

模型特點

教育領域優化

專門針對兒童故事內容的問題生成進行微調，適合教育應用

敘事元素理解

能夠處理七種不同類型的敘事元素和關係

多組件輸入處理

使用特殊標記區分答案、文本等不同輸入組件

模型能力

西班牙語文本理解

教育問題生成

敘事元素分析

使用案例

教育技術

閱讀理解輔助

為西班牙語兒童故事自動生成理解性問題

幫助學生提高敘事理解能力

教育內容開發

為教師自動生成教學材料和評估問題

節省教師準備時間，提供多樣化問題

🚀 t5s-spanish-qg 模型卡片

t5s-spanish-qg 是一個基於 T5 的模型，它在西班牙語的機器翻譯版本的原始英文 FairytaleQA 數據集上，從 T5S 微調而來。該模型的微調任務是問題生成。你可以查看我們已被 ECTEL 2024 接受的論文。

✨ 主要特性

基於 T5 架構，在西班牙語數據集上進行微調，適用於問題生成任務。
利用特殊標籤區分輸入組件，優化問題生成過程。
訓練過程中採用了最大 20 個 epoch 和早停策略，提高訓練效率。
推理時使用束搜索，提升生成問題的質量。

📦 安裝指南

本部分未提及具體安裝步驟，因此跳過。

💻 使用示例

基礎用法

>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qg")
>>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)

重要提示：需要添加特殊標記並調整模型標記的大小：

>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))

高級用法

input_text = '<respuesta>' + 'Un Oso.' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'

source_encoding = tokenizer(
    input_text,
    max_length=512,
    padding='max_length',
    truncation = 'only_second',
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors='pt'
)
    
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    num_return_sequences=1,
    num_beams=5,
    max_length=512,
    repetition_penalty=1.0,
    length_penalty=1.0,
    early_stopping=True,
    use_cache=True
)

prediction = {
    tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
    for generated_id in generated_ids
}

generated_str = ''.join(preds)

print(generated_str)

使用建議：更多代碼細節請查看我們的倉庫。

📚 詳細文檔

訓練數據

FairytaleQA 是一個開源數據集，旨在提高對敘事的理解，目標用戶是幼兒園到八年級的學生。該數據集由教育專家根據循證理論框架精心註釋而成。它包含從 278 個適合兒童的故事中提取的 10,580 個顯式和隱式問題，涵蓋七種類型的敘事元素或關係。

評估 - 問題生成

模型	ROUGEL-F1
t5（針對原始英文數據集，基線模型）	0.530
t5s-spanish-qg（針對西班牙語機器翻譯數據集）	0.445

🔧 技術細節

編碼器將答案和文本進行拼接，解碼器生成問題。我們使用特殊標籤來區分各個組件。最大輸入標記設置為 512，最大輸出標記設置為 128。在訓練期間，模型最多進行 20 個 epoch 的訓練，並採用早停策略，耐心值為 2。使用的批量大小為 16。在推理時，我們使用束寬為 5 的束搜索。

📄 許可證

此微調模型根據 Apache-2.0 許可證發佈。

引用信息

本項目論文

@article{leite_fairytaleqa_translated_2024,
        title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, 
        author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
        year={2024},
        eprint={2406.04233},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
}

原始 FairytaleQA 論文

@inproceedings{xu-etal-2022-fantastic,
    title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
    author = "Xu, Ying  and
      Wang, Dakuo  and
      Yu, Mo  and
      Ritchie, Daniel  and
      Yao, Bingsheng  and
      Wu, Tongshuang  and
      Zhang, Zheng  and
      Li, Toby  and
      Bradford, Nora  and
      Sun, Branda  and
      Hoang, Tran  and
      Sang, Yisi  and
      Hou, Yufang  and
      Ma, Xiaojuan  and
      Yang, Diyi  and
      Peng, Nanyun  and
      Yu, Zhou  and
      Warschauer, Mark",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.34",
    doi = "10.18653/v1/2022.acl-long.34",
    pages = "447--460",
    abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}

T5S 模型論文

@inproceedings{araujo-etal-2024-sequence-sequence,
    title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
    author = "Araujo, Vladimir  and
      Trusca, Maria Mihaela  and
      Tufi{\~n}o, Rodrigo  and
      Moens, Marie-Francine",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1283",
    pages = "14729--14743",
    abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
}