T5-large-spell English Spelling Correction Model - Free and Open-source for Automatically Correcting Text Spelling and Typing Errors

T5 Large Spell

Developed by ai-forever

An English spelling correction model trained based on T5-large, capable of automatically correcting spelling and typographical errors in text

Large Language Model

Transformers

EnglishOpen Source License:MIT #English spelling correction #T5 large model optimization #Multi-type error correction

Downloads 2,241

Release Time : 7/29/2023

Model Overview

This model corrects spelling and typographical errors by converting all words in the text to standard English forms. It is trained on the T5-large model using an extended dataset containing artificially introduced errors

Model Features

High-precision spelling correction

Performs excellently on BEA60K and JFLEG datasets, with F1 scores surpassing multiple comparative models

Based on T5-large architecture

Utilizes the powerful T5-large model for training, possessing excellent natural language processing capabilities

Synthetic error training data

Trained on an extended dataset with automatically injected errors using the SAGE library, covering various error types

Model Capabilities

Spelling error detection

Typographical error correction

Text standardization

Natural language generation

Use Cases

Text processing

Document proofreading

Automatically detects and corrects spelling errors in documents

Improves document quality and professionalism

Content creation assistance

Helps authors correct spelling errors in writing

Enhances writing efficiency and accuracy

Education

Language learning assistance

Helps English learners identify and correct spelling errors

Improves learning efficiency and accuracy

🚀 T5-large-spell model

This model corrects spelling errors and typos, standardizing all words in the text to proper English. It is trained based on the T5-large model. The training corpus consists of an extensive dataset with “artificial” errors, assembled from English Wikipedia and news blogs, and then typos and spelling mistakes were automatically introduced using the SAGE library.

🚀 Quick Start

from transformers import T5ForConditionalGeneration, AutoTokenizer

path_to_model = "ai-forever/T5-large-spell"

model = T5ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
prefix = "grammar: "

sentence = "If you bought something goregous, you well be very happy."
sentence = prefix + sentence

encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encodings)
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer)

# ["If you bought something gorgeous, you will be very happy."]

✨ Features

Corrects spelling errors and typos in English text.
Trained on a large dataset with artificial errors.

📚 Documentation

Summary

The model corrects spelling errors and typos by bringing all words in the text to the standard English language. The proofreader was trained based on the T5-large model. An extensive dataset with “artificial” errors was taken as a training corpus: the corpus was assembled on the basis of the English-language Wikipedia and News blogs, then typos and spelling errors were automatically introduced into it using the functionality of the SAGE library.

Public references

SAGE library announcement, DataFest 2023
Paper about synthetic error generation methods, Dialogue 2023
Paper about SAGE and our best solution, Review EACL 2024

Examples

Input	Output
Th festeivаl was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chаllenging, bet brilli an t ea.	The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see.
That 's why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n't be any problem with being up - do - date .	That's why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won't be any problem with being up - do - date.
If you bought something goregous, you well be very happy.	If you bought something gorgeous, you will be very happy.

🔧 Technical Details

Metrics

Quality

Below are automatic metrics for determining the correctness of the spell checkers. We present a comparison of our solution both with open automatic spell checkers and with the ChatGPT family of models on two available datasets:

BEA60K: English spelling errors collected from several domains;
JFLEG: 1601 sentences in English, which contain about 2 thousand spelling errors;

BEA60K

Model	Precision	Recall	F1
T5-large-spell	66.5	83.1	73.9
ChatGPT gpt-3.5-turbo-0301	66.9	84.1	74.5
ChatGPT gpt-4-0314	68.6	85.2	76.0
ChatGPT text-davinci-003	67.8	83.9	75.0
Bert (https://github.com/neuspell/neuspell)	65.8	79.6	72.0
SC-LSTM (https://github.com/neuspell/neuspell)	62.2	80.3	72.0

JFLEG

Model	Precision	Recall	F1
T5-large-spell	83.4	84.3	83.8
ChatGPT gpt-3.5-turbo-0301	77.8	88.6	82.9
ChatGPT gpt-4-0314	77.9	88.3	82.8
ChatGPT text-davinci-003	76.8	88.5	82.2
Bert (https://github.com/neuspell/neuspell)	78.5	85.4	81.8
SC-LSTM (https://github.com/neuspell/neuspell)	80.6	86.1	83.2

📦 Resources

SAGE library, GitHub
ruM2M100-1.2B, HuggingFace
ruM2M100-418M, HuggingFace
FredT5-large-spell, HuggingFace
T5-large-spell, HuggingFace

📄 License

The T5-large model, on which our solution is based, and its source code are supplied under the APACHE-2.0 license. Our solution is supplied under MIT license.

📋 Specifications

Property	Details
File size	3 Gb
Framework	pytorch
Format	AI Service
Version	v1.0
Developer	SberDevices, AGI NLP

📞 Contacts

nikita.martynov.98@list.ru

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご