đ Sentence-Doctor
Sentence Doctor is a T5 model designed to correct errors in sentences. It supports English, German, and French text, aiming to enhance the quality of text data in NLP applications.

đ Quick Start
Sentence Doctor is a T5 model that can correct errors or mistakes in sentences. It works with English, German, and French text.
⨠Features
Problem Solving
Many NLP models rely on tasks such as Text Extraction Libraries, OCR, Speech to Text libraries, and Sentence Boundary Detection. Errors from these tasks in the NLP pipeline can affect model quality, especially since models are often trained on clean input.
Solution Approach
This model attempts to reconstruct sentences based on their context (surrounding text). The task is straightforward: Given an "erroneous" sentence and its context, reconstruct the "intended" sentence
.
Use Cases
- Repair noisy sentences extracted by OCR software or text extractors.
- Fix sentence boundaries. For example, in German:
- Input: "und ich bin im"
- Prefix_Context: "Hallo! Mein Name ist John"
- Postfix_Context: "Januar 1990 geboren."
- Output: "John und ich bin im Jahr 1990 geboren"
- Potentially perform sentence-level spelling correction, although this is not the primary use.
- Input: "I went to church las yesteday" => Output: "I went to church last Sunday".
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
Preprocessing
text = "That is my job I am a medical doctor I save lives"
sentences = ["That is my job I a", "m a medical doct", "I save lives"]
input_text = "repair_sentence: " + sentences[1] + " context: {" + sentences[0] + "}{" + sentences[2] + "} </s>"
print(input_text)
The context is optional, so the input could also be repair_sentence: m a medical doct context: {}{} </s>
Inference
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
model = AutoModelWithLMHead.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
input_text = "repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=32, num_beams=1)
sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
assert sentence == "I am a medical doctor."
Advanced Usage
Fine - tuning
We provide a script train_any_t5_task.py
to help you fine - tune any Text2Text Task with T5. You can set parameters as follows:
config.TRAIN_EPOCHS = 3
If you don't want to read the #TODO comments, just pass in your data like this:
trainer.start("data/sentence_doctor_dataset_300.csv")
đ Documentation
Disclaimer
Note that we always emphasize the word attempt. The current version of the model was only trained on 150K sentences from the tatoeba dataset: https://tatoeba.org/eng. (50K per language -- En, Fr, De). Hence, we strongly encourage you to fine - tune the model on your dataset. We might release a version trained on more data.
Datasets
We generated synthetic data from the tatoeba dataset: https://tatoeba.org/eng. Randomly applying different transformations on words and characters based on some probabilities. The datasets are available in the data folder (where sentence_doctor_dataset_300K is a larger dataset with 100K sentences for each language).
đ§ Technical Details
The model is based on the T5 architecture. It was fine - tuned from the huggingface hub: WikinewsSum/t5 - base - multi - combine - wiki - news.
đ License
No license information is provided in the original document.
đ Attribution
- Huggingface transformer lib for making this possible.
- Abhishek Kumar Mishra's transformer [tutorial](https://github.com/abhimishra91/transformers - tutorials/blob/master/transformers_summarization_wandb.ipynb) on text summarization. Our training code is a modified version of their code.
- We fine - tuned this model from the huggingface hub: WikinewsSum/t5 - base - multi - combine - wiki - news. Thanks to the authors.
- We also referred to a lot of work from [Suraj Patil](https://github.com/patil - suraj).