🚀 T5-large-spell模型
該模型能夠將文本中的所有單詞轉換為標準英語,從而糾正拼寫錯誤和打字錯誤。此校對器基於 T5-large 模型進行訓練。訓練語料採用了包含 “人工” 錯誤的大型數據集:該語料庫基於英文維基百科和新聞博客構建,然後使用 SAGE庫 的功能自動引入拼寫錯誤和打字錯誤。
✨ 主要特性
- 能夠有效糾正英文文本中的拼寫錯誤和打字錯誤。
- 基於強大的 T5-large 模型進行訓練。
- 使用包含 “人工” 錯誤的大型數據集進行訓練,提升糾錯能力。
📦 安裝指南
文檔未提及具體安裝步驟,暫不提供。
💻 使用示例
基礎用法
from transformers import T5ForConditionalGeneration, AutoTokenizer
path_to_model = "ai-forever/T5-large-spell"
model = T5ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
prefix = "grammar: "
sentence = "If you bought something goregous, you well be very happy."
sentence = prefix + sentence
encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encodings)
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer)
📚 詳細文檔
公共引用
示例
輸入 |
輸出 |
Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea. |
The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see. |
That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date . |
That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date. |
If you bought something goregous, you well be very happy. |
If you bought something gorgeous, you will be very happy. |
指標
質量
以下是用於確定拼寫檢查器正確性的自動指標。我們在兩個可用數據集上,將我們的解決方案與開源自動拼寫檢查器以及ChatGPT系列模型進行了比較:
- BEA60K:從多個領域收集的英文拼寫錯誤;
- JFLEG:1601個英文句子,其中包含約2000個拼寫錯誤;
BEA60K
模型 |
精確率 |
召回率 |
F1值 |
T5-large-spell |
66.5 |
83.1 |
73.9 |
ChatGPT gpt - 3.5 - turbo - 0301 |
66.9 |
84.1 |
74.5 |
ChatGPT gpt - 4 - 0314 |
68.6 |
85.2 |
76.0 |
ChatGPT text - davinci - 003 |
67.8 |
83.9 |
75.0 |
Bert (https://github.com/neuspell/neuspell) |
65.8 |
79.6 |
72.0 |
SC - LSTM (https://github.com/neuspell/neuspell) |
62.2 |
80.3 |
72.0 |
JFLEG
模型 |
精確率 |
召回率 |
F1值 |
T5-large-spell |
83.4 |
84.3 |
83.8 |
ChatGPT gpt - 3.5 - turbo - 0301 |
77.8 |
88.6 |
82.9 |
ChatGPT gpt - 4 - 0314 |
77.9 |
88.3 |
82.8 |
ChatGPT text - davinci - 003 |
76.8 |
88.5 |
82.2 |
Bert (https://github.com/neuspell/neuspell) |
78.5 |
85.4 |
81.8 |
SC - LSTM (https://github.com/neuspell/neuspell) |
80.6 |
86.1 |
83.2 |
資源
🔧 技術細節
該模型基於 T5-large 模型進行訓練,使用了包含 “人工” 錯誤的大型數據集。數據集基於英文維基百科和新聞博客構建,然後使用 SAGE庫 的功能自動引入拼寫錯誤和打字錯誤,以此提升模型的糾錯能力。
📄 許可證
我們的解決方案基於的 T5-large 模型及其源代碼遵循APACHE - 2.0許可證。我們的解決方案遵循MIT許可證。
📋 規格說明
屬性 |
詳情 |
文件大小 |
3 Gb |
框架 |
pytorch |
格式 |
AI服務 |
版本 |
v1.0 |
開發者 |
SberDevices, AGI NLP |
📞 聯繫方式
nikita.martynov.98@list.ru