T5-large-spell英语拼写校正模型 - 免费开源自动纠正文本拼写和打字错误

首页

T5 Large Spell

由 ai-forever 开发

基于T5-large训练的英语拼写校正模型，可自动纠正文本中的拼写错误和打字错误

大型语言模型

Transformers

英语开源协议:MIT #英语拼写校正 #T5大模型优化 #多类型错误修正

下载量 2,241

发布时间 : 7/29/2023

模型简介

该模型通过将文本中的所有单词转换为标准英语形式来纠正拼写错误和打字错误，基于T5-large模型训练，使用包含人工制造错误的扩展数据集

模型特点

高精度拼写校正

在BEA60K和JFLEG数据集上表现优异，F1值超过多个对比模型

基于T5-large架构

利用强大的T5-large模型进行训练，具备优秀的自然语言处理能力

合成错误训练数据

使用SAGE库自动注入错误的扩展数据集训练，覆盖多种错误类型

模型能力

拼写错误检测

打字错误纠正

文本标准化

自然语言生成

使用案例

文本处理

文档校对

自动检测和纠正文档中的拼写错误

提高文档质量和专业性

内容创作辅助

帮助作者纠正写作中的拼写错误

提升写作效率和准确性

教育

语言学习辅助

帮助英语学习者识别和纠正拼写错误

提高学习效率和准确性

🚀 T5-large-spell模型

该模型能够将文本中的所有单词转换为标准英语，从而纠正拼写错误和打字错误。此校对器基于 T5-large 模型进行训练。训练语料采用了包含 “人工” 错误的大型数据集：该语料库基于英文维基百科和新闻博客构建，然后使用 SAGE库的功能自动引入拼写错误和打字错误。

✨ 主要特性

能够有效纠正英文文本中的拼写错误和打字错误。
基于强大的 T5-large 模型进行训练。
使用包含 “人工” 错误的大型数据集进行训练，提升纠错能力。

📦 安装指南

文档未提及具体安装步骤，暂不提供。

💻 使用示例

基础用法

from transformers import T5ForConditionalGeneration, AutoTokenizer

path_to_model = "ai-forever/T5-large-spell"

model = T5ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
prefix = "grammar: "

sentence = "If you bought something goregous, you well be very happy."
sentence = prefix + sentence

encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encodings)
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer)

# ["If you bought something gorgeous, you will be very happy."]

📚 详细文档

公共引用

示例

输入	输出
Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea.	The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see.
That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date .	That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date.
If you bought something goregous, you well be very happy.	If you bought something gorgeous, you will be very happy.

指标

质量

以下是用于确定拼写检查器正确性的自动指标。我们在两个可用数据集上，将我们的解决方案与开源自动拼写检查器以及ChatGPT系列模型进行了比较：

BEA60K：从多个领域收集的英文拼写错误；
JFLEG：1601个英文句子，其中包含约2000个拼写错误；

BEA60K

模型	精确率	召回率	F1值
T5-large-spell	66.5	83.1	73.9
ChatGPT gpt - 3.5 - turbo - 0301	66.9	84.1	74.5
ChatGPT gpt - 4 - 0314	68.6	85.2	76.0
ChatGPT text - davinci - 003	67.8	83.9	75.0
Bert (https://github.com/neuspell/neuspell)	65.8	79.6	72.0
SC - LSTM (https://github.com/neuspell/neuspell)	62.2	80.3	72.0

JFLEG

模型	精确率	召回率	F1值
T5-large-spell	83.4	84.3	83.8
ChatGPT gpt - 3.5 - turbo - 0301	77.8	88.6	82.9
ChatGPT gpt - 4 - 0314	77.9	88.3	82.8
ChatGPT text - davinci - 003	76.8	88.5	82.2
Bert (https://github.com/neuspell/neuspell)	78.5	85.4	81.8
SC - LSTM (https://github.com/neuspell/neuspell)	80.6	86.1	83.2