t5s - スペイン語 - QGオープンソースモデル - 無料でスペイン語の質問を生成し、スペイン語学習アプリをサポート

ホーム

T5s Spanish Qg

benjleiteによって開発

T5ベースのスペイン語問題生成モデル、FairytaleQAデータセットのスペイン語機械翻訳版からファインチューニング

質問応答システム

Transformers

スペイン語オープンソースライセンス:Apache-2.0 #スペイン語問題生成 #教育的ナラティブ理解 #T5ファインチューニングモデル

ダウンロード数 50

リリース時間 : 6/18/2024

モデル概要

このモデルはスペイン語テキストから教育的な問題を生成するために特別に設計されており、ナラティブ要素の理解をサポート、K-8学年の教育シーンに適しています

モデル特徴

教育分野最適化

児童向けストーリー内容の問題生成に特化してファインチューニング、教育アプリケーションに適しています

ナラティブ要素理解

7種類の異なるナラティブ要素と関係性を処理可能

マルチコンポーネント入力処理

特殊トークンを使用して回答、テキストなどの異なる入力コンポーネントを区別

モデル能力

スペイン語テキスト理解

教育問題生成

ナラティブ要素分析

使用事例

教育技術

読解力サポート

スペイン語の児童向けストーリーに対して理解問題を自動生成

生徒のナラティブ理解能力向上を支援

教育コンテンツ開発

教師向けに教材や評価問題を自動生成

教師の準備時間を節約し、多様な問題を提供

🚀 t5s-spanish-qg モデルカード

t5s-spanish-qg は、T5ベースのモデルです。T5S を、元の英語版FairytaleQAデータセットの スペイン語の機械翻訳版 データセットでファインチューニングしています。ファインチューニングのタスクは質問生成です。ECTEL 2024で採択された当社の論文をご覧いただけます。

✨ 主な機能

T5ベースのモデルで、スペイン語の質問生成タスクに特化しています。
スペイン語の機械翻訳版FairytaleQAデータセットでファインチューニングされています。

📦 インストール

モデルとトークナイザーの読み込み

>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qg")
>>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)

重要な注意: 特殊トークンを追加し、モデルのトークンをリサイズする必要があります。

>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))

💻 使用例

基本的な使用法

input_text = '<respuesta>' + 'Un Oso.' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'

source_encoding = tokenizer(
    input_text,
    max_length=512,
    padding='max_length',
    truncation = 'only_second',
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors='pt'
)
    
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    num_return_sequences=1,
    num_beams=5,
    max_length=512,
    repetition_penalty=1.0,
    length_penalty=1.0,
    early_stopping=True,
    use_cache=True
)

prediction = {
    tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
    for generated_id in generated_ids
}

generated_str = ''.join(preds)

print(generated_str)

注: 追加のコード詳細については、当社のリポジトリを参照してください。

📚 ドキュメント

モデルの説明

t5s-spanish-qg は、T5ベースのモデルで、スペイン語の質問生成タスクに特化しています。元の英語版FairytaleQAデータセットのスペイン語機械翻訳版でファインチューニングされています。

学習データ

FairytaleQA は、物語の理解を向上させるために設計されたオープンソースのデータセットです。幼稚園から中学8年生までの生徒を対象としています。このデータセットは、教育専門家によってエビデンスベースの理論的枠組みに基づいて細心の注意を払って注釈付けされています。278の子供向けの物語から派生した10,580の明示的および暗黙的な質問で構成されており、7種類の物語要素または関係をカバーしています。

実装詳細

エンコーダは回答とテキストを連結し、デコーダは質問を生成します。私たちは、コンポーネントを区別するために特殊なラベルを使用しています。最大トークン入力は512に設定され、最大トークン出力は128に設定されています。学習中、モデルは最大20エポックで学習され、早期終了が2の忍耐力で組み込まれています。バッチサイズは16が使用されます。推論時には、ビーム幅5のビームサーチを利用しています。

評価 - 質問生成

モデル	ROUGEL-F1
t5 (元の英語データセット用、ベースライン)	0.530
t5s-spanish-qg (スペイン語機械翻訳データセット用)	0.445

🔧 技術詳細

エンコーダは回答とテキストを連結し、デコーダは質問を生成します。
特殊なラベルを使用してコンポーネントを区別します。
最大トークン入力は512、最大トークン出力は128に設定されています。
学習時には最大20エポックで学習され、早期終了が2の忍耐力で組み込まれています。
バッチサイズは16が使用されます。
推論時には、ビーム幅5のビームサーチを利用しています。

📄 ライセンス

このファインチューニングされたモデルは、Apache-2.0ライセンスの下で公開されています。

引用情報

当社の論文 (プレプリント - ECTEL 2024で採択)

@article{leite_fairytaleqa_translated_2024,
        title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, 
        author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
        year={2024},
        eprint={2406.04233},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
}

元のFairytaleQA論文

@inproceedings{xu-etal-2022-fantastic,
    title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
    author = "Xu, Ying  and
      Wang, Dakuo  and
      Yu, Mo  and
      Ritchie, Daniel  and
      Yao, Bingsheng  and
      Wu, Tongshuang  and
      Zhang, Zheng  and
      Li, Toby  and
      Bradford, Nora  and
      Sun, Branda  and
      Hoang, Tran  and
      Sang, Yisi  and
      Hou, Yufang  and
      Ma, Xiaojuan  and
      Yang, Diyi  and
      Peng, Nanyun  and
      Yu, Zhou  and
      Warschauer, Mark",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.34",
    doi = "10.18653/v1/2022.acl-long.34",
    pages = "447--460",
    abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}

T5Sモデル

@inproceedings{araujo-etal-2024-sequence-sequence,
    title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
    author = "Araujo, Vladimir  and
      Trusca, Maria Mihaela  and
      Tufi{\~n}o, Rodrigo  and
      Moens, Marie-Francine",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1283",
    pages = "14729--14743",
    abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
}