Sage-m2m100-1.2B Open-source Russian Spell Checker - Free to Correct Spelling and Typing Errors

Sage M2m100 1.2B

Developed by ai-forever

A Russian spell checker trained on the M2M100-1.2B model for correcting spelling and typing errors

OtherOpen Source License:MIT #Russian spelling correction #Multi-domain text standardization #High-precision grammar correction

Downloads 184

Release Time : 3/11/2024

Model Overview

This model corrects spelling and typing errors by standardizing all words in the text to Russian norms. The training corpus uses a broad dataset containing 'artificial' errors, built from Russian Wikipedia and Russian video transcriptions.

Model Features

Multi-domain applicability

Performs well on various Russian datasets across different domains, including social media, medical, and technical texts

High-precision correction

Achieves 88.8% precision and 71.5% recall on the RUSpellRU dataset

Large model-based

Fine-tuned on the 1.2B-parameter M2M100 model, with strong language understanding capabilities

Model Capabilities

Russian spell checking

Typo correction

Text normalization

Use Cases

Text processing

Social media text correction

Corrects non-standard spellings and typos in social media content

Achieves an F1 score of 79.2 on the RUSpellRU dataset

Medical text standardization

Corrects spelling errors in professional medical terminology

Achieves an F1 score of 74.9 on the MedSpellchecker dataset

Technical document processing

Code comment correction

Corrects spelling errors in GitHub commit messages

Achieves an F1 score of 44.9 on the GitHubTypoCorpusRu dataset

🚀 sage-m2m100-1.2B Model

This model corrects spelling errors and typos in Russian text, bringing words to the language norm. It's based on the M2M100-1.2B model and fine - tuned for this specific task.

🚀 Quick Start

The sage-m2m100-1.2B model is designed to correct spelling errors and typos in Russian text. It was trained on an extensive dataset with artificially introduced errors, created from Russian - language Wikipedia and video transcripts.

✨ Features

Spelling Correction: Corrects spelling errors and typos in Russian text.
Fine - Tuned Model: Based on the pre - trained M2M100-1.2B model and fine - tuned for Russian spelling correction.
Extensive Training Data: Trained on a large corpus with synthetic errors generated using the SAGE library.

📚 Documentation

Summary

The model corrects spelling errors and typos by normalizing all the words in the text to the Russian language norm. The corrector was trained based on the model M2M100-1.2B. An extensive dataset with “artificial” errors was used as the training corpus. The corpus was assembled from Russian - language Wikipedia and transcripts of Russian - language videos, and then typos and spelling errors were automatically introduced using the SAGE library. The model is a fine - tuned version of the pre - train.

Public references

SAGE library announcement, DataFest 2023
Paper about synthetic error generation methods, Dialogue 2023
SAGE EACL 2024 paper

Model Index

Property	Details
Model Name	sage - m2m100 - 1.2B
Task Type	text - generation
Datasets	spellcheck_benchmark (RUSpellRU, MultidomainGold, MedSpellchecker, GitHubTypoCorpusRu)
Metrics	Precision, Recall, F1

Results

Dataset	Precision	Recall	F1
RUSpellRU	88.8	71.5	79.2
MultidomainGold	63.8	61.1	62.4
MedSpellchecker	78.8	71.4	74.9
GitHubTypoCorpusRu	47.1	42.9	44.9

💻 Usage Examples

Basic Usage

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

path_to_model = "ai-forever/sage-m2m100-1.2B"
model = M2M100ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = M2M100Tokenizer.from_pretrained(path_to_model, src_lang="ru", tgt_lang="ru")

sentence = "прийдя в МГТУ я был удивлен никого необноружив там…"
encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(
        **encodings, forced_bos_token_id=tokenizer.get_lang_id("ru"))
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

print(answer)
#["прийдя в МГТУ я был удивлен никого не обнаружив там..."]

Examples Table

Input	Output
Думю ешцъа лет череа 10 ретроспективно просматривотьэ то будкетцц мне невероя тна ин те р но	Думаю что лет через 10 ретроспективно просматривать это будет мне невероятно интересно
Основая цель мероприятия - практическая отработка навыков по оказанию помощи гражданам, попавшим в ДТП, а также повышение и совершенствование уровня профессиональной подготовки сотрудников МЧС при проведении аварийно-спасательных работ по ликвидации последствий дорожно-транспортных проишествий, сокращение временных показателей реагирования.	Основная цель мероприятия - практическая отработка навыков по оказанию помощи гражданам, попавшим в ДТП, а также повышение и совершенствование уровня профессиональной подготовки сотрудников МЧС при проведении аварийно-спасательных работ по ликвидации последствий дорожно-транспортных происшествий, сокращение временных показателей реагирования.
прийдя в МГТУ я был удивлен никого необноружив там…	придя в МГТУ я был удивлен никого не обнаружив там

📊 Metrics

Quality

The following are the automatic metrics for determining the correctness of the spell checkers. We compare our solution with both open automatic spell checkers and the ChatGPT family of models on all four available datasets:

RUSpellRU: Texts collected from (LiveJournal), with manually corrected typos and errors.
MultidomainGold: Examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents, and literary works.
MedSpellChecker: Texts with errors from medical anamnesis.
GitHubTypoCorpusRu: Spelling errors and typos in commits from GitHub.

RUSpellRU

Model	Precision	Recall	F1
sage - m2m100 - 1.2B	88.8	71.5	79.2
sage - ai - service	93.5	82.4	87.6
gpt - 3.5 - turbo	39.6	62.3	48.5
gpt - 4	69.5	81.0	74.8
Yandex.Speller	83.0	59.8	69.5
JamSpell	42.1	32.8	36.9
HunSpell	31.3	34.9	33.0

MultidomainGold

Model	Precision	Recall	F1
sage - m2m100 - 1.2B	63.8	61.1	62.4
sage - ai - service	70.9	68.8	69.9
gpt - 3.5 - turbo	17.8	56.1	27.0
gpt - 4	31.1	78.1	44.5
Yandex.Speller	52.9	51.4	52.2
JamSpell	25.7	30.6	28.0
HunSpell	16.2	40.1	23.0

MedSpellChecker

Model	Precision	Recall	F1
sage - m2m100 - 1.2B	78.8	71.4	74.9
sage - ai - service	73.4	76.2	74.9
gpt - 3.5 - turbo	15.1	53.6	23.5
gpt - 4	48.9	88.7	63.1
Yandex.Speller	80.6	47.8	60.0
JamSpell	24.6	29.7	26.9
HunSpell	10.3	40.2	16.4

GitHubTypoCorpusRu

Model	Precision	Recall	F1
sage - m2m100 - 1.2B	47.1	42.9	44.9
sage - ai - service	76.1	51.2	61.2
gpt - 3.5 - turbo	23.7	43.9	30.8
gpt - 4	34.7	60.5	44.1
Yandex.Speller	67.7	37.5	48.3
JamSpell	49.5	29.9	37.3
HunSpell	28.5	30.7	29.6

📦 Resources

SAGE library, GitHub
sage - fredt5 - large, HuggingFace
sage - fredt5 - distilled - 95m, HuggingFace
sage - m2m100 - 1.2B, HuggingFace
sage - mt5 - large, HuggingFace

📄 License

The model M2M100-1.2B, on the basis of which our solution is made, and its source code are supplied under the MIT open license. Our solution also comes with the MIT license.

🔧 Technical Details

Specifications

Property	Details
File size	5 Gb
Framework	pytorch
Format	AI Service
Version	v2.0
Developer	SberDevices, AGI NLP

📞 Contacts

nikita.martynov.98@list.ru

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご