Persian_XLM_RoBERTa_Large Open-source Q&A Model - Specifically Designed for Persian Q&A Needs

Home

Persian Xlm Roberta Large

Developed by pedramyazdipoor

A QA model fine-tuned on the Persian QA dataset PQuAD based on the XLM-RoBERTA multilingual pretrained model

Question Answering System

Transformers

#Persian QA #Multilingual Pretraining #High-precision QA

Downloads 77

Release Time : 9/18/2022

Model Overview

This model is an optimized XLM-RoBERTA large model for Persian QA tasks, fine-tuned on the PQuAD dataset, supporting Persian question answering.

Model Features

Multilingual Pretraining Foundation

Based on the XLM-RoBERTA large model supporting 100 languages

Persian Optimization

Specifically fine-tuned on the largest Persian QA dataset, PQuAD

Efficient Training

Completed training with limited GPU resources using techniques like gradient accumulation

Model Capabilities

Persian QA

Cross-lingual Transfer Learning

Text Understanding

Use Cases

Education

Persian Learning Assistance

Helps learners understand Persian texts through Q&A

Exact Match 66.56%, F1 Score 87.31%

Information Retrieval

Persian Document QA System

Extracts answers from Persian documents

🚀 Persian XLM-RoBERTA Large For Question Answering Task

This model is based on XLM-RoBERTA, a multilingual language model pre-trained on 2.5TB of filtered CommonCrawl data across 100 languages. It was introduced in the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al. The multilingual XLM-RoBERTa large for QA on various languages is fine-tuned on multiple QA datasets, excluding PQuAD, which is the largest Persian QA dataset to date. This second model serves as our base model for fine-tuning.

Paper presenting the PQuAD dataset: arXiv:2202.06219

🚀 Quick Start

This model is fine-tuned on the PQuAD Train set and is readily available for use. The long training time motivated me to publish this model to assist those in need.

✨ Features

Utilizes the powerful XLM-RoBERTA large model.
Fine-tuned on the PQuAD dataset for Persian question answering.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
path = 'pedramyazdipoor/persian_xlm_roberta_large'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForQuestionAnswering.from_pretrained(path)

Advanced Usage

# Considerations for inference:
# 1) Start index of answer must be smaller than end index.
# 2) The span of answer must be within the context.
# 3) The selected span must be the most probable choice among N pairs of candidates.

def generate_indexes(start_logits, end_logits, N, min_index):
  
  output_start = start_logits
  output_end = end_logits

  start_indexes = np.arange(len(start_logits))
  start_probs = output_start
  list_start = dict(zip(start_indexes, start_probs.tolist()))
  end_indexes = np.arange(len(end_logits))
  end_probs = output_end
  list_end = dict(zip(end_indexes, end_probs.tolist()))

  sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability
  sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True)

  final_start_idx, final_end_idx = [[] for l in range(2)]

  start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0])
  for a in range(0,N):
    for b in range(0,N):
      if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob :
        if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_start_list[a][0] > min_index) :
          prob = sorted_start_list[a][1] + sorted_end_list[b][1]
          start_idx = sorted_start_list[a][0]
          end_idx = sorted_end_list[b][0]
  final_start_idx.append(start_idx)    
  final_end_idx.append(end_idx)      

  return final_start_idx[0], final_end_idx[0]

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval().to(device)
text = 'سلام من پدرامم 26 سالمه'
question = 'چند سالمه؟'
encoding = tokenizer(question,text,add_special_tokens = True,
                     return_token_type_ids = True,
                     return_tensors = 'pt',
                     padding = True,
                     return_offsets_mapping = True,
                     truncation = 'only_first',
                     max_length = 32)
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
#we had to change some pieces of code to make it compatible with one answer generation at a time
#If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens.
#you can initialize min_index in generate_indexes() to put force on tokens being chosen to be within the context(startindex must be greater than seperator token).
answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0)
print(tokenizer.tokenize(text + question))
print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)])
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'چند', '▁سالم', 'ه', '؟']
>>> ['▁26']

🔧 Technical Details

Hyperparameters of training

Due to the GPU memory limitations in Google Colab, the batch size was set to 4.

batch_size = 4
n_epochs = 1
base_LM_model = "deepset/xlm-roberta-large-squad2"
max_seq_len = 256
learning_rate = 3e-5
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate = 3e-5,
warmup_ratio = 0.1,
gradient_accumulation_steps = 8,
weight_decay = 0.01,

Performance

The model is evaluated on the PQuAD Persian test set via the official PQuAD link. Training for more than 1 epoch resulted in worse performance. Our XLM-Roberta outperforms our ParsBert on PQuAD, but it is over 3 times larger. Thus, a direct comparison between the two is not fair.

Question Answering On Test Set of PQuAD Dataset

Property	Details
Model Type	Persian XLM-RoBERTA Large
Training Data	PQuAD Train set

Metric	Our XLM-Roberta Large	Our ParsBert
Exact Match	66.56*	47.44
F1	87.31*	81.96

📄 License

No license information is provided in the original document.

Acknowledgments

We would like to express our gratitude to Newsha Shahbodaghkhan for facilitating the dataset gathering.

Contributors

Pedram Yazdipoor : Linkedin

Releases

Release v0.2 (Sep 18, 2022)

This is the second version of our Persian XLM-Roberta-Large. There were some issues with the previous version.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご