roberta - base - best - squad2开源英语问答模型 - 免费处理有/无答案问答场景

首页

Roberta Base Best Squad2

由 PremalMatalia 开发

基于RoBERTa的英语抽取式问答模型，在SQuAD 2.0数据集上训练，能处理有答案和无答案的问答场景

问答系统

Transformers

#英语问答系统 #高精度阅读理解 #SQuAD2.0优化

下载量 30

发布时间 : 3/2/2022

模型简介

该模型是基于RoBERTa-base架构优化的问答系统，专门针对SQuAD 2.0数据集进行了微调，能够准确回答基于给定文本的问题或判断问题是否无解

模型特点

无答案检测能力

使用特殊阈值CLS_threshold=-3来更准确识别无答案情况

高性能表现

在SQuAD 2.0测试集上达到81.19%的精确匹配率和83.95%的F1分数

优化参数设置

采用多项式学习率调度器和AdamW优化器，经过6轮训练

模型能力

文本理解

问题回答

无答案检测

上下文分析

使用案例

教育

阅读理解辅助

帮助学生快速从文本中找到问题答案

提高学习效率和理解能力

客户服务

FAQ自动回答

从知识库文档中提取问题答案

减少人工客服工作量

🚀 RoBERTa-base用于问答任务

本项目基于RoBERTa-base语言模型，专注于抽取式问答任务，使用SQuAD 2.0数据集进行训练和评估。

🚀 快速开始

本项目使用roberta-base语言模型进行抽取式问答任务，训练和评估数据均为SQuAD 2.0。

✨ 主要特性

语言模型：采用roberta-base。
下游任务：专注于抽取式问答。
训练与评估数据：使用SQuAD 2.0数据集。

📦 安装指南

文档未提供具体安装步骤，暂不展示。

💻 使用示例

基础用法

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "PremalMatalia/roberta-base-best-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Which name is also used to describe the Amazon rainforest in English?',
    'context': 'The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet\'s remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.'
}
res = nlp(QA_input)
print(res)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

📚 详细文档

环境信息

属性	详情
`transformers`版本	4.9.1
平台	Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python版本	3.7.11
PyTorch版本（是否使用GPU）	1.9.0+cu102（否）
Tensorflow版本（是否使用GPU）	2.5.0（否）

超参数

max_seq_len=386
doc_stride=128
n_best_size=20
max_answer_length=30
min_null_score=7.0
batch_size=8

n_epochs=6
base_LM_model = "roberta-base"
learning_rate=1.5e-5
adam_epsilon=1e-5
adam_beta1=0.95
adam_beta2=0.999
warmup_steps=100
weight_decay=0.01
optimizer=AdamW
lr_scheduler="polynomial"

⚠️ 重要提示

有一个特殊的阈值CLS_threshold=-3，用于更准确地识别无答案情况，具体逻辑将在GitHub仓库中提供（待更新）。

性能指标

"exact": 81.192622
"f1":    83.95408
"total": 11873
"HasAns_exact": 74.190283
"HasAns_f1":    79.721119
"HasAns_total": 5928
"NoAns_exact":  88.174937
"NoAns_f1":     88.174937
"NoAns_total":  5945