t5-base-multi-sentence-doctor Open Source Model - Free to Correct Text Sentence Errors in English, German and French

Home

T5 Base Multi Sentence Doctor

Developed by flexudy

A T5-based model for correcting sentence errors in English, German, and French texts

Large Language Model

Transformers

#Multilingual sentence repair #Context-aware error correction #OCR post-processing

Downloads 341

Release Time : 3/2/2022

Model Overview

This model is designed to repair erroneous sentences produced by tools such as OCR, text extraction, or sentence boundary detection, capable of reconstructing correct sentence structures based on context.

Model Features

Multilingual support

Capable of handling sentence repair tasks for English, German, and French

Context-aware repair

Utilizes contextual information for more accurate sentence reconstruction

Synthetic data training

Trained using probabilistically randomized synthetic data to enhance model robustness

Model Capabilities

Fixing OCR errors

Correcting sentence boundary errors

Sentence-level spelling correction

Context-aware text reconstruction

Use Cases

Text processing

OCR post-processing

Repairing erroneous sentences caused by OCR recognition in scanned documents

Repairing 'm a medical doct' to 'I am a medical doctor'

Sentence boundary correction

Correcting erroneous sentence segmentation results

Merging and repairing incorrectly split sentences 'That is my job I a' and 'm a medical doct' into a complete sentence

🚀 Sentence-Doctor

Sentence Doctor is a T5 model designed to correct errors in sentences. It supports English, German, and French text, aiming to enhance the quality of text data in NLP applications.

🚀 Quick Start

Sentence Doctor is a T5 model that can correct errors or mistakes in sentences. It works with English, German, and French text.

✨ Features

Problem Solving

Many NLP models rely on tasks such as Text Extraction Libraries, OCR, Speech to Text libraries, and Sentence Boundary Detection. Errors from these tasks in the NLP pipeline can affect model quality, especially since models are often trained on clean input.

Solution Approach

This model attempts to reconstruct sentences based on their context (surrounding text). The task is straightforward: Given an "erroneous" sentence and its context, reconstruct the "intended" sentence.

Use Cases

Repair noisy sentences extracted by OCR software or text extractors.
Fix sentence boundaries. For example, in German:
- Input: "und ich bin im"
- Prefix_Context: "Hallo! Mein Name ist John"
- Postfix_Context: "Januar 1990 geboren."
- Output: "John und ich bin im Jahr 1990 geboren"
Potentially perform sentence-level spelling correction, although this is not the primary use.
- Input: "I went to church las yesteday" => Output: "I went to church last Sunday".

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Preprocessing

text = "That is my job I am a medical doctor I save lives"
sentences = ["That is my job I a", "m a medical doct", "I save lives"]
input_text = "repair_sentence: " + sentences[1] + " context: {" + sentences[0] + "}{" + sentences[2] + "} </s>"
print(input_text) # repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>

The context is optional, so the input could also be repair_sentence: m a medical doct context: {}{} </s>

Inference

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
model = AutoModelWithLMHead.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
input_text = "repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)
sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
assert sentence == "I am a medical doctor."

Advanced Usage

Fine - tuning

We provide a script train_any_t5_task.py to help you fine - tune any Text2Text Task with T5. You can set parameters as follows:

# TODO Set your training epochs
config.TRAIN_EPOCHS = 3

If you don't want to read the #TODO comments, just pass in your data like this:

# TODO Where is your data ? Enter the path
trainer.start("data/sentence_doctor_dataset_300.csv")

📚 Documentation

Disclaimer

Note that we always emphasize the word attempt. The current version of the model was only trained on 150K sentences from the tatoeba dataset: https://tatoeba.org/eng. (50K per language -- En, Fr, De). Hence, we strongly encourage you to fine - tune the model on your dataset. We might release a version trained on more data.

Datasets

We generated synthetic data from the tatoeba dataset: https://tatoeba.org/eng. Randomly applying different transformations on words and characters based on some probabilities. The datasets are available in the data folder (where sentence_doctor_dataset_300K is a larger dataset with 100K sentences for each language).

🔧 Technical Details

The model is based on the T5 architecture. It was fine - tuned from the huggingface hub: WikinewsSum/t5 - base - multi - combine - wiki - news.

📄 License

No license information is provided in the original document.

📄 Attribution

Huggingface transformer lib for making this possible.
Abhishek Kumar Mishra's transformer [tutorial](https://github.com/abhimishra91/transformers - tutorials/blob/master/transformers_summarization_wandb.ipynb) on text summarization. Our training code is a modified version of their code.
We fine - tuned this model from the huggingface hub: WikinewsSum/t5 - base - multi - combine - wiki - news. Thanks to the authors.
We also referred to a lot of work from [Suraj Patil](https://github.com/patil - suraj).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご