flan-t5-large-grammar-synthesis Open-source Text Model - Free Syntax Correction While Preserving Semantics

Flan T5 Large Grammar Synthesis

Developed by pszemraj

A text-to-text model fine-tuned based on google/flan-t5-large, specializing in grammar correction tasks, capable of processing texts with numerous errors while preserving the semantics of grammatically correct texts.

Large Language Model #Grammar Correction #Text Error Correction #Language Model Optimization

Downloads 25.07k

Release Time : 11/26/2022

Model Overview

This model is primarily used for single-pass grammar correction, particularly suitable for handling texts that may contain numerous grammatical errors while ensuring the original information of grammatically correct texts remains unchanged.

Model Features

Single-Pass Grammar Correction

Capable of correcting various grammatical errors in text in one go, including spelling, punctuation, and structural issues.

Semantic Preservation

Ensures the original semantic information of the text remains unchanged while correcting grammatical errors.

Batch Processing Capability

Supports batch processing of multiple sentences or short paragraphs to improve efficiency.

ONNX Support

Provides ONNX format checkpoints for more efficient inference using the optimum library.

Model Capabilities

Grammar Error Correction

Spelling Correction

Punctuation Correction

Sentence Structure Optimization

Text Normalization

Use Cases

Text Processing

Audio Transcription Correction

Correct grammatical errors in transcription texts output by automatic speech recognition (ASR) systems.

Improves the readability and accuracy of transcribed texts

Chatbot Response Optimization

Correct grammatical errors in texts generated by chatbots to enhance conversation quality.

Makes conversations more natural and fluent

OCR Post-Processing

Correct text errors in outputs from optical character recognition (OCR) systems.

Improves the accuracy of OCR output texts

Education

Writing Assistance

Helps students or non-native speakers identify and correct grammatical errors in their writing.

Improves writing quality

🚀 flan-t5-large-grammar-synthesis

A fine - tuned model based on google/flan - t5 - large for grammar correction on an expanded JFLEG dataset, capable of single - shot grammar correction without changing correct semantics.

🚀 Quick Start

There's a colab notebook that already has this basic version implemented (click on the Open in Colab button).

After pip install transformers, run the following code:

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

✨ Features

A fine - tuned version of [google/flan - t5 - large](https://huggingface.co/google/flan - t5 - large) for grammar correction on an expanded JFLEG dataset.
Capable of "single - shot grammar correction" on potentially grammatically incorrect text with many mistakes, without semantically changing grammatically correct text/information.
Converted to ONNX and can be loaded/used with huggingface's optimum library.

📦 Installation

Install `transformers`

pip install transformers

Install `optimum` for ONNX usage

pip install optimum[onnxruntime]
# ^ if you want to use a different runtime read their docs

💻 Usage Examples

Basic Usage

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

Advanced Usage (Batch Inference)

For batch inference, see [this discussion thread](https://huggingface.co/pszemraj/flan - t5 - large - grammar - synthesis/discussions/1) for details. Essentially, the dataset consists of several sentences at a time. It is recommended to run inference in batches of 64 - 96 tokens (or 2 - 3 sentences split with regex).

# Batch inference example code can be found in the discussion thread and relevant notebooks

ONNX Usage

from optimum.pipelines import pipeline

corrector = pipeline(
    "text2text-generation", model=corrector_model_name, accelerator="ort"
)
# use as normal

📚 Documentation

Model Description

The intent is to create a text2text language model that successfully completes "single - shot grammar correction" on a potentially grammatically incorrect text that could have a lot of mistakes, with the important qualifier of not semantically changing text/information that is grammatically correct.

Use Cases

Correcting highly error - prone LM outputs: Such as audio transcription (ASR) or handwriting OCR. It might be worth applying this after OCR on typed characters depending on the model/system used.
Correcting text generated by text generation models: To make the text cohesive and remove obvious errors that break the conversation immersion. For example, it can be used on the outputs of [this OPT 2.7B chatbot - esque model of myself](https://huggingface.co/pszemraj/opt - peter - 2.7B).
Fixing tortured - phrases: Correcting so - called tortured - phrases that indicate text was generated by a language model. Note that some of these are not fixed, especially when they involve domain - specific terminology.

Limitations

The dataset uses the cc - by - nc - sa - 4.0 license, and the model uses the apache - 2.0 license.
This is still a work - in - progress. While it is probably useful for "single - shot grammar correction" in many cases, please check the outputs for correctness.

Citation Info

If you find this fine - tuned model useful in your work, please consider citing it:

@misc {peter_szemraj_2022,
	author       = { {Peter Szemraj} },
	title        = { flan - t5 - large - grammar - synthesis (Revision d0b5ae2) },
	year         = 2022,
	url          = { https://huggingface.co/pszemraj/flan - t5 - large - grammar - synthesis },
	doi          = { 10.57967/hf/0138 },
	publisher    = { Hugging Face }
}

Information Table

Property	Details
Model Type	A fine - tuned version of google/flan - t5 - large for grammar correction
Training Data	An expanded version of the JFLEG dataset
License (Dataset)	cc - by - nc - sa - 4.0
License (Model)	apache - 2.0

Tips

⚠️ Important Note

This model is still a work - in - progress. Please check the outputs for correctness.

💡 Usage Tip

For batch inference, it is recommended to run inference in batches of 64 - 96 tokens (or 2 - 3 sentences split with regex). Also, it is helpful to first check whether a given sentence needs grammar correction before using the text2text model, which can be done with BERT - type models fine - tuned on CoLA like textattack/roberta - base - CoLA.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Flan T5 Large Grammar Synthesis

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 flan-t5-large-grammar-synthesis

🚀 Quick Start

✨ Features

📦 Installation

Install transformers

Install optimum for ONNX usage

💻 Usage Examples

Basic Usage

Advanced Usage (Batch Inference)

ONNX Usage

📚 Documentation

Model Description

Use Cases

Limitations

Citation Info

Information Table

Tips

Install `transformers`

Install `optimum` for ONNX usage