T5s Spanish Qg
基於T5的西班牙語問題生成模型,從FairytaleQA數據集的西班牙語機器翻譯版本微調而來
下載量 50
發布時間 : 6/18/2024
模型概述
該模型專門用於從西班牙語文本生成教育相關問題,支持對敘事元素的理解,適用於K-8年級教育場景
模型特點
教育領域優化
專門針對兒童故事內容的問題生成進行微調,適合教育應用
敘事元素理解
能夠處理七種不同類型的敘事元素和關係
多組件輸入處理
使用特殊標記區分答案、文本等不同輸入組件
模型能力
西班牙語文本理解
教育問題生成
敘事元素分析
使用案例
教育技術
閱讀理解輔助
為西班牙語兒童故事自動生成理解性問題
幫助學生提高敘事理解能力
教育內容開發
為教師自動生成教學材料和評估問題
節省教師準備時間,提供多樣化問題
🚀 t5s-spanish-qg 模型卡片
t5s-spanish-qg 是一個基於 T5 的模型,它在西班牙語的 機器翻譯版本 的 原始英文 FairytaleQA 數據集 上,從 T5S 微調而來。該模型的微調任務是問題生成。你可以查看我們已被 ECTEL 2024 接受的 論文。
✨ 主要特性
- 基於 T5 架構,在西班牙語數據集上進行微調,適用於問題生成任務。
- 利用特殊標籤區分輸入組件,優化問題生成過程。
- 訓練過程中採用了最大 20 個 epoch 和早停策略,提高訓練效率。
- 推理時使用束搜索,提升生成問題的質量。
📦 安裝指南
本部分未提及具體安裝步驟,因此跳過。
💻 使用示例
基礎用法
>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qg")
>>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)
重要提示: 需要添加特殊標記並調整模型標記的大小:
>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))
高級用法
input_text = '<respuesta>' + 'Un Oso.' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'
source_encoding = tokenizer(
input_text,
max_length=512,
padding='max_length',
truncation = 'only_second',
return_attention_mask=True,
add_special_tokens=True,
return_tensors='pt'
)
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
num_return_sequences=1,
num_beams=5,
max_length=512,
repetition_penalty=1.0,
length_penalty=1.0,
early_stopping=True,
use_cache=True
)
prediction = {
tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
for generated_id in generated_ids
}
generated_str = ''.join(preds)
print(generated_str)
使用建議: 更多代碼細節請查看我們的 倉庫。
📚 詳細文檔
訓練數據
FairytaleQA 是一個開源數據集,旨在提高對敘事的理解,目標用戶是幼兒園到八年級的學生。該數據集由教育專家根據循證理論框架精心註釋而成。它包含從 278 個適合兒童的故事中提取的 10,580 個顯式和隱式問題,涵蓋七種類型的敘事元素或關係。
評估 - 問題生成
模型 | ROUGEL-F1 |
---|---|
t5(針對原始英文數據集,基線模型) | 0.530 |
t5s-spanish-qg(針對西班牙語機器翻譯數據集) | 0.445 |
🔧 技術細節
編碼器將答案和文本進行拼接,解碼器生成問題。我們使用特殊標籤來區分各個組件。最大輸入標記設置為 512,最大輸出標記設置為 128。在訓練期間,模型最多進行 20 個 epoch 的訓練,並採用早停策略,耐心值為 2。使用的批量大小為 16。在推理時,我們使用束寬為 5 的束搜索。
📄 許可證
此微調模型根據 Apache-2.0 許可證 發佈。
引用信息
本項目論文
@article{leite_fairytaleqa_translated_2024,
title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages},
author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
year={2024},
eprint={2406.04233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
原始 FairytaleQA 論文
@inproceedings{xu-etal-2022-fantastic,
title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
author = "Xu, Ying and
Wang, Dakuo and
Yu, Mo and
Ritchie, Daniel and
Yao, Bingsheng and
Wu, Tongshuang and
Zhang, Zheng and
Li, Toby and
Bradford, Nora and
Sun, Branda and
Hoang, Tran and
Sang, Yisi and
Hou, Yufang and
Ma, Xiaojuan and
Yang, Diyi and
Peng, Nanyun and
Yu, Zhou and
Warschauer, Mark",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.34",
doi = "10.18653/v1/2022.acl-long.34",
pages = "447--460",
abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}
T5S 模型論文
@inproceedings{araujo-etal-2024-sequence-sequence,
title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
author = "Araujo, Vladimir and
Trusca, Maria Mihaela and
Tufi{\~n}o, Rodrigo and
Moens, Marie-Francine",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1283",
pages = "14729--14743",
abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
}
Distilbert Base Cased Distilled Squad
Apache-2.0
DistilBERT是BERT的輕量級蒸餾版本,參數量減少40%,速度提升60%,保留95%以上性能。本模型是在SQuAD v1.1數據集上微調的問答專用版本。
問答系統 英語
D
distilbert
220.76k
244
Distilbert Base Uncased Distilled Squad
Apache-2.0
DistilBERT是BERT的輕量級蒸餾版本,參數量減少40%,速度提升60%,在GLUE基準測試中保持BERT 95%以上的性能。本模型專為問答任務微調。
問答系統
Transformers 英語

D
distilbert
154.39k
115
Tapas Large Finetuned Wtq
Apache-2.0
TAPAS是基於BERT架構的表格問答模型,通過自監督方式在維基百科表格數據上預訓練,支持對錶格內容進行自然語言問答
問答系統
Transformers 英語

T
google
124.85k
141
T5 Base Question Generator
基於t5-base的問答生成模型,輸入答案和上下文,輸出相應問題
問答系統
Transformers

T
iarfmoose
122.74k
57
Bert Base Cased Qa Evaluator
基於BERT-base-cased的問答對評估模型,用於判斷問題和答案是否語義相關
問答系統
B
iarfmoose
122.54k
9
Tiny Doc Qa Vision Encoder Decoder
MIT
一個基於MIT許可證的文檔問答模型,主要用於測試目的。
問答系統
Transformers

T
fxmarty
41.08k
16
Dpr Question Encoder Single Nq Base
DPR(密集段落檢索)是用於開放領域問答研究的工具和模型。該模型是基於BERT的問題編碼器,使用自然問題(NQ)數據集訓練。
問答系統
Transformers 英語

D
facebook
32.90k
30
Mobilebert Uncased Squad V2
MIT
MobileBERT是BERT_LARGE的輕量化版本,在SQuAD2.0數據集上微調而成的問答系統模型。
問答系統
Transformers 英語

M
csarron
29.11k
7
Tapas Base Finetuned Wtq
Apache-2.0
TAPAS是一個基於Transformer的表格問答模型,通過自監督學習在維基百科表格數據上預訓練,並在WTQ等數據集上微調。
問答系統
Transformers 英語

T
google
23.03k
217
Dpr Question Encoder Multiset Base
基於BERT的密集段落檢索(DPR)問題編碼器,用於開放領域問答研究,在多個QA數據集上訓練
問答系統
Transformers 英語

D
facebook
17.51k
4
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98