🚀 T5-large-spell模型
该模型能够将文本中的所有单词转换为标准英语,从而纠正拼写错误和打字错误。此校对器基于 T5-large 模型进行训练。训练语料采用了包含 “人工” 错误的大型数据集:该语料库基于英文维基百科和新闻博客构建,然后使用 SAGE库 的功能自动引入拼写错误和打字错误。
✨ 主要特性
- 能够有效纠正英文文本中的拼写错误和打字错误。
- 基于强大的 T5-large 模型进行训练。
- 使用包含 “人工” 错误的大型数据集进行训练,提升纠错能力。
📦 安装指南
文档未提及具体安装步骤,暂不提供。
💻 使用示例
基础用法
from transformers import T5ForConditionalGeneration, AutoTokenizer
path_to_model = "ai-forever/T5-large-spell"
model = T5ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
prefix = "grammar: "
sentence = "If you bought something goregous, you well be very happy."
sentence = prefix + sentence
encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encodings)
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer)
📚 详细文档
公共引用
示例
输入 |
输出 |
Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea. |
The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see. |
That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date . |
That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date. |
If you bought something goregous, you well be very happy. |
If you bought something gorgeous, you will be very happy. |
指标
质量
以下是用于确定拼写检查器正确性的自动指标。我们在两个可用数据集上,将我们的解决方案与开源自动拼写检查器以及ChatGPT系列模型进行了比较:
- BEA60K:从多个领域收集的英文拼写错误;
- JFLEG:1601个英文句子,其中包含约2000个拼写错误;
BEA60K
模型 |
精确率 |
召回率 |
F1值 |
T5-large-spell |
66.5 |
83.1 |
73.9 |
ChatGPT gpt - 3.5 - turbo - 0301 |
66.9 |
84.1 |
74.5 |
ChatGPT gpt - 4 - 0314 |
68.6 |
85.2 |
76.0 |
ChatGPT text - davinci - 003 |
67.8 |
83.9 |
75.0 |
Bert (https://github.com/neuspell/neuspell) |
65.8 |
79.6 |
72.0 |
SC - LSTM (https://github.com/neuspell/neuspell) |
62.2 |
80.3 |
72.0 |
JFLEG
模型 |
精确率 |
召回率 |
F1值 |
T5-large-spell |
83.4 |
84.3 |
83.8 |
ChatGPT gpt - 3.5 - turbo - 0301 |
77.8 |
88.6 |
82.9 |
ChatGPT gpt - 4 - 0314 |
77.9 |
88.3 |
82.8 |
ChatGPT text - davinci - 003 |
76.8 |
88.5 |
82.2 |
Bert (https://github.com/neuspell/neuspell) |
78.5 |
85.4 |
81.8 |
SC - LSTM (https://github.com/neuspell/neuspell) |
80.6 |
86.1 |
83.2 |
资源
🔧 技术细节
该模型基于 T5-large 模型进行训练,使用了包含 “人工” 错误的大型数据集。数据集基于英文维基百科和新闻博客构建,然后使用 SAGE库 的功能自动引入拼写错误和打字错误,以此提升模型的纠错能力。
📄 许可证
我们的解决方案基于的 T5-large 模型及其源代码遵循APACHE - 2.0许可证。我们的解决方案遵循MIT许可证。
📋 规格说明
属性 |
详情 |
文件大小 |
3 Gb |
框架 |
pytorch |
格式 |
AI服务 |
版本 |
v1.0 |
开发者 |
SberDevices, AGI NLP |
📞 联系方式
nikita.martynov.98@list.ru