persian_xlm_roberta_largeオープンソースQAモデル - ペルシャ語のQAニーズに特化して開発

ホーム

Persian Xlm Roberta Large

pedramyazdipoorによって開発

XLM-RoBERTA多言語事前学習モデルをベースに、ペルシャ語QAデータセットPQuADでファインチューニングされたQAモデル

質問応答システム

Transformers

#ペルシャ語QA #多言語事前学習 #高精度QA

ダウンロード数 77

リリース時間 : 9/18/2022

モデル概要

このモデルはペルシャ語QAタスク向けに最適化されたXLM-RoBERTA大規模モデルで、PQuADデータセットでファインチューニングされ、ペルシャ語QAタスクをサポートします。

モデル特徴

多言語事前学習基盤

100言語をサポートするXLM-RoBERTA大規模モデルをベースに構築

ペルシャ語最適化

最大のペルシャ語QAデータセットPQuADで専用にファインチューニング

効率的なトレーニング

勾配蓄積などの技術を採用し、限られたGPUリソースでトレーニングを完了

モデル能力

ペルシャ語質問応答

言語間転移学習

テキスト理解

使用事例

教育

ペルシャ語学習支援

学習者が質問応答方式でペルシャ語テキストを理解するのを支援

正確一致率66.56%、F1スコア87.31%

情報検索

ペルシャ語ドキュメントQAシステム

ペルシャ語ドキュメントから回答を抽出

🚀 質問応答タスク用のペルシャ語XLM - RoBERTA Large

XLM - RoBERTAは、100言語を含む2.5TBのフィルタリングされたCommonCrawlデータで事前学習された多言語言語モデルです。これは、Conneauらによる論文 Unsupervised Cross - lingual Representation Learning at Scale で紹介されました。

多言語の [様々な言語のQA用XLM - RoBERTa large](https://huggingface.co/deepset/xlm - roberta - large - squad2) は、これまで最大のペルシャ語QAデータセットであるPQuADを除く、様々なQAデータセットで微調整されています。この2番目のモデルが、私たちが微調整するベースモデルです。

PQuADデータセットを紹介する論文: arXiv:2202.06219

🚀 クイックスタート

このモデルはPQuADトレーニングセットで微調整されており、すぐに使用できます。非常に長いトレーニング時間を考えると、必要な人のためにこのモデルを公開することにしました。

✨ 主な機能

ペルシャ語の質問応答タスクに特化した微調整済みモデル。
PQuADデータセットでの評価において、ParsBertを上回る性能を発揮。

🔧 技術詳細

トレーニングのハイパーパラメータ

Google ColabのGPUメモリの制限により、バッチサイズを4に設定しました。

batch_size = 4
n_epochs = 1
base_LM_model = "deepset/xlm - roberta - large - squad2"
max_seq_len = 256
learning_rate = 3e - 5
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate = 3e - 5,
warmup_ratio = 0.1,
gradient_accumulation_steps = 8,
weight_decay = 0.01,

性能評価

公式PQuADリンクのPQuADペルシャ語テストセットで評価されました。1エポック以上トレーニングした場合、結果が悪くなることがわかりました。私たちのXLM - Robertaは、PQuADでのParsBert を上回っていますが、前者は後者の3倍以上大きいため、これら2つを比較するのは公平ではありません。

PQuADデータセットのテストセットでの質問応答

メトリック	私たちのXLM - Roberta Large	私たちのParsBert
完全一致率 (Exact Match)	66.56*	47.44
F1スコア	87.31*	81.96

💻 使用例

基本的な使用法

Pytorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
path = 'pedramyazdipoor/persian_xlm_roberta_large'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForQuestionAnswering.from_pretrained(path)

高度な使用法

推論

推論にはいくつかの注意点があります。

回答の開始インデックスは終了インデックスより小さくなければなりません。
回答の範囲は文脈内に収まっていなければなりません。
選択された範囲は、N組の候補の中で最も確率の高い選択肢でなければなりません。

def generate_indexes(start_logits, end_logits, N, min_index):
  
  output_start = start_logits
  output_end = end_logits

  start_indexes = np.arange(len(start_logits))
  start_probs = output_start
  list_start = dict(zip(start_indexes, start_probs.tolist()))
  end_indexes = np.arange(len(end_logits))
  end_probs = output_end
  list_end = dict(zip(end_indexes, end_probs.tolist()))

  sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability
  sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True)

  final_start_idx, final_end_idx = [[] for l in range(2)]

  start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0])
  for a in range(0,N):
    for b in range(0,N):
      if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob :
        if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_start_list[a][0] > min_index) :
          prob = sorted_start_list[a][1] + sorted_end_list[b][1]
          start_idx = sorted_start_list[a][0]
          end_idx = sorted_end_list[b][0]
  final_start_idx.append(start_idx)    
  final_end_idx.append(end_idx)      

  return final_start_idx[0], final_end_idx[0]

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval().to(device)
text = 'سلام من پدرامم 26 سالمه'
question = 'چند سالمه؟'
encoding = tokenizer(question,text,add_special_tokens = True,
                     return_token_type_ids = True,
                     return_tensors = 'pt',
                     padding = True,
                     return_offsets_mapping = True,
                     truncation = 'only_first',
                     max_length = 32)
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
#we had to change some pieces of code to make it compatible with one answer generation at a time
#If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens.
#you can initialize min_index in generate_indexes() to put force on tokens being chosen to be within the context(startindex must be greater than seperator token).
answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0)
print(tokenizer.tokenize(text + question))
print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)])
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'چند', '▁سالم', 'ه', '؟']
>>> ['▁26']