T5s Spanish Qg
基于T5的西班牙语问题生成模型,从FairytaleQA数据集的西班牙语机器翻译版本微调而来
下载量 50
发布时间 : 6/18/2024
模型简介
该模型专门用于从西班牙语文本生成教育相关问题,支持对叙事元素的理解,适用于K-8年级教育场景
模型特点
教育领域优化
专门针对儿童故事内容的问题生成进行微调,适合教育应用
叙事元素理解
能够处理七种不同类型的叙事元素和关系
多组件输入处理
使用特殊标记区分答案、文本等不同输入组件
模型能力
西班牙语文本理解
教育问题生成
叙事元素分析
使用案例
教育技术
阅读理解辅助
为西班牙语儿童故事自动生成理解性问题
帮助学生提高叙事理解能力
教育内容开发
为教师自动生成教学材料和评估问题
节省教师准备时间,提供多样化问题
🚀 t5s-spanish-qg 模型卡片
t5s-spanish-qg 是一个基于 T5 的模型,它在西班牙语的 机器翻译版本 的 原始英文 FairytaleQA 数据集 上,从 T5S 微调而来。该模型的微调任务是问题生成。你可以查看我们已被 ECTEL 2024 接受的 论文。
✨ 主要特性
- 基于 T5 架构,在西班牙语数据集上进行微调,适用于问题生成任务。
- 利用特殊标签区分输入组件,优化问题生成过程。
- 训练过程中采用了最大 20 个 epoch 和早停策略,提高训练效率。
- 推理时使用束搜索,提升生成问题的质量。
📦 安装指南
本部分未提及具体安装步骤,因此跳过。
💻 使用示例
基础用法
>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qg")
>>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)
重要提示: 需要添加特殊标记并调整模型标记的大小:
>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))
高级用法
input_text = '<respuesta>' + 'Un Oso.' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'
source_encoding = tokenizer(
input_text,
max_length=512,
padding='max_length',
truncation = 'only_second',
return_attention_mask=True,
add_special_tokens=True,
return_tensors='pt'
)
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
num_return_sequences=1,
num_beams=5,
max_length=512,
repetition_penalty=1.0,
length_penalty=1.0,
early_stopping=True,
use_cache=True
)
prediction = {
tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
for generated_id in generated_ids
}
generated_str = ''.join(preds)
print(generated_str)
使用建议: 更多代码细节请查看我们的 仓库。
📚 详细文档
训练数据
FairytaleQA 是一个开源数据集,旨在提高对叙事的理解,目标用户是幼儿园到八年级的学生。该数据集由教育专家根据循证理论框架精心注释而成。它包含从 278 个适合儿童的故事中提取的 10,580 个显式和隐式问题,涵盖七种类型的叙事元素或关系。
评估 - 问题生成
模型 | ROUGEL-F1 |
---|---|
t5(针对原始英文数据集,基线模型) | 0.530 |
t5s-spanish-qg(针对西班牙语机器翻译数据集) | 0.445 |
🔧 技术细节
编码器将答案和文本进行拼接,解码器生成问题。我们使用特殊标签来区分各个组件。最大输入标记设置为 512,最大输出标记设置为 128。在训练期间,模型最多进行 20 个 epoch 的训练,并采用早停策略,耐心值为 2。使用的批量大小为 16。在推理时,我们使用束宽为 5 的束搜索。
📄 许可证
此微调模型根据 Apache-2.0 许可证 发布。
引用信息
本项目论文
@article{leite_fairytaleqa_translated_2024,
title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages},
author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
year={2024},
eprint={2406.04233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
原始 FairytaleQA 论文
@inproceedings{xu-etal-2022-fantastic,
title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
author = "Xu, Ying and
Wang, Dakuo and
Yu, Mo and
Ritchie, Daniel and
Yao, Bingsheng and
Wu, Tongshuang and
Zhang, Zheng and
Li, Toby and
Bradford, Nora and
Sun, Branda and
Hoang, Tran and
Sang, Yisi and
Hou, Yufang and
Ma, Xiaojuan and
Yang, Diyi and
Peng, Nanyun and
Yu, Zhou and
Warschauer, Mark",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.34",
doi = "10.18653/v1/2022.acl-long.34",
pages = "447--460",
abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}
T5S 模型论文
@inproceedings{araujo-etal-2024-sequence-sequence,
title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
author = "Araujo, Vladimir and
Trusca, Maria Mihaela and
Tufi{\~n}o, Rodrigo and
Moens, Marie-Francine",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1283",
pages = "14729--14743",
abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
}
Distilbert Base Cased Distilled Squad
Apache-2.0
DistilBERT是BERT的轻量级蒸馏版本,参数量减少40%,速度提升60%,保留95%以上性能。本模型是在SQuAD v1.1数据集上微调的问答专用版本。
问答系统 英语
D
distilbert
220.76k
244
Distilbert Base Uncased Distilled Squad
Apache-2.0
DistilBERT是BERT的轻量级蒸馏版本,参数量减少40%,速度提升60%,在GLUE基准测试中保持BERT 95%以上的性能。本模型专为问答任务微调。
问答系统
Transformers 英语

D
distilbert
154.39k
115
Tapas Large Finetuned Wtq
Apache-2.0
TAPAS是基于BERT架构的表格问答模型,通过自监督方式在维基百科表格数据上预训练,支持对表格内容进行自然语言问答
问答系统
Transformers 英语

T
google
124.85k
141
T5 Base Question Generator
基于t5-base的问答生成模型,输入答案和上下文,输出相应问题
问答系统
Transformers

T
iarfmoose
122.74k
57
Bert Base Cased Qa Evaluator
基于BERT-base-cased的问答对评估模型,用于判断问题和答案是否语义相关
问答系统
B
iarfmoose
122.54k
9
Tiny Doc Qa Vision Encoder Decoder
MIT
一个基于MIT许可证的文档问答模型,主要用于测试目的。
问答系统
Transformers

T
fxmarty
41.08k
16
Dpr Question Encoder Single Nq Base
DPR(密集段落检索)是用于开放领域问答研究的工具和模型。该模型是基于BERT的问题编码器,使用自然问题(NQ)数据集训练。
问答系统
Transformers 英语

D
facebook
32.90k
30
Mobilebert Uncased Squad V2
MIT
MobileBERT是BERT_LARGE的轻量化版本,在SQuAD2.0数据集上微调而成的问答系统模型。
问答系统
Transformers 英语

M
csarron
29.11k
7
Tapas Base Finetuned Wtq
Apache-2.0
TAPAS是一个基于Transformer的表格问答模型,通过自监督学习在维基百科表格数据上预训练,并在WTQ等数据集上微调。
问答系统
Transformers 英语

T
google
23.03k
217
Dpr Question Encoder Multiset Base
基于BERT的密集段落检索(DPR)问题编码器,用于开放领域问答研究,在多个QA数据集上训练
问答系统
Transformers 英语

D
facebook
17.51k
4
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98