🚀 flan-t5-large-grammar-synthesis
A fine - tuned model based on google/flan - t5 - large for grammar correction on an expanded JFLEG dataset, capable of single - shot grammar correction without changing correct semantics.
🚀 Quick Start
There's a colab notebook that already has this basic version implemented (click on the Open in Colab button).
After pip install transformers
, run the following code:
from transformers import pipeline
corrector = pipeline(
'text2text-generation',
'pszemraj/flan-t5-large-grammar-synthesis',
)
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)
✨ Features
- A fine - tuned version of [google/flan - t5 - large](https://huggingface.co/google/flan - t5 - large) for grammar correction on an expanded JFLEG dataset.
- Capable of "single - shot grammar correction" on potentially grammatically incorrect text with many mistakes, without semantically changing grammatically correct text/information.
- Converted to ONNX and can be loaded/used with huggingface's
optimum
library.
📦 Installation
Install transformers
pip install transformers
Install optimum
for ONNX usage
pip install optimum[onnxruntime]
💻 Usage Examples
Basic Usage
from transformers import pipeline
corrector = pipeline(
'text2text-generation',
'pszemraj/flan-t5-large-grammar-synthesis',
)
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)
Advanced Usage (Batch Inference)
For batch inference, see [this discussion thread](https://huggingface.co/pszemraj/flan - t5 - large - grammar - synthesis/discussions/1) for details. Essentially, the dataset consists of several sentences at a time. It is recommended to run inference in batches of 64 - 96 tokens (or 2 - 3 sentences split with regex).
ONNX Usage
from optimum.pipelines import pipeline
corrector = pipeline(
"text2text-generation", model=corrector_model_name, accelerator="ort"
)
📚 Documentation
Model Description
The intent is to create a text2text language model that successfully completes "single - shot grammar correction" on a potentially grammatically incorrect text that could have a lot of mistakes, with the important qualifier of not semantically changing text/information that is grammatically correct.
Use Cases
- Correcting highly error - prone LM outputs: Such as audio transcription (ASR) or handwriting OCR. It might be worth applying this after OCR on typed characters depending on the model/system used.
- Correcting text generated by text generation models: To make the text cohesive and remove obvious errors that break the conversation immersion. For example, it can be used on the outputs of [this OPT 2.7B chatbot - esque model of myself](https://huggingface.co/pszemraj/opt - peter - 2.7B).
- Fixing tortured - phrases: Correcting so - called tortured - phrases that indicate text was generated by a language model. Note that some of these are not fixed, especially when they involve domain - specific terminology.
Limitations
- The dataset uses the
cc - by - nc - sa - 4.0
license, and the model uses the apache - 2.0
license.
- This is still a work - in - progress. While it is probably useful for "single - shot grammar correction" in many cases, please check the outputs for correctness.
Citation Info
If you find this fine - tuned model useful in your work, please consider citing it:
@misc {peter_szemraj_2022,
author = { {Peter Szemraj} },
title = { flan - t5 - large - grammar - synthesis (Revision d0b5ae2) },
year = 2022,
url = { https://huggingface.co/pszemraj/flan - t5 - large - grammar - synthesis },
doi = { 10.57967/hf/0138 },
publisher = { Hugging Face }
}
Information Table
Property |
Details |
Model Type |
A fine - tuned version of google/flan - t5 - large for grammar correction |
Training Data |
An expanded version of the JFLEG dataset |
License (Dataset) |
cc - by - nc - sa - 4.0 |
License (Model) |
apache - 2.0 |
Tips
⚠️ Important Note
This model is still a work - in - progress. Please check the outputs for correctness.
💡 Usage Tip
For batch inference, it is recommended to run inference in batches of 64 - 96 tokens (or 2 - 3 sentences split with regex). Also, it is helpful to first check whether a given sentence needs grammar correction before using the text2text model, which can be done with BERT - type models fine - tuned on CoLA like textattack/roberta - base - CoLA
.