🚀 Parrot
Parrot is a paraphrase-based utterance augmentation framework specifically designed to accelerate the training of NLU models. A paraphrase framework is more than just a paraphrasing model. For more details on the library and its usage, please refer to the github page.
🚀 Quick Start
Get started with Parrot by following the steps below.
📦 Installation
pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
💻 Usage Examples
Basic Usage
from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")
'''
uncomment to get reproducable paraphrase generations
def random_state(seed):
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
random_state(1234)
'''
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)
phrases = ["Can you recommed some upscale restaurants in Newyork?",
"What are the famous places we should not miss in Russia?"
]
for phrase in phrases:
print("-"*100)
print("Input_phrase: ", phrase)
print("-"*100)
para_phrases = parrot.augment(input_phrase=phrase)
for para_phrase in para_phrases:
print(para_phrase)
----------------------------------------------------------------------
Input_phrase: Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------
list some excellent restaurants to visit in new york city?
what upscale restaurants do you recommend in new york?
i want to try some upscale restaurants in new york?
recommend some upscale restaurants in newyork?
can you recommend some high end restaurants in newyork?
can you recommend some upscale restaurants in new york?
can you recommend some upscale restaurants in newyork?
----------------------------------------------------------------------
Input_phrase: What are the famous places we should not miss in Russia
----------------------------------------------------------------------
what should we not miss when visiting russia?
recommend some of the best places to visit in russia?
list some of the best places to visit in russia?
can you list the top places to visit in russia?
show the places that we should not miss in russia?
list some famous places which we should not miss in russia?
Advanced Usage
para_phrases = parrot.augment(input_phrase=phrase,
diversity_ranker="levenshtein",
do_diverse=False,
max_return_phrases = 10,
max_length=32,
adequacy_threshold = 0.99,
fluency_threshold = 0.90)
📚 Documentation
Why Parrot?
Huggingface lists 12 paraphrase models, RapidAPI lists 7 fremium and commercial paraphrasers like QuillBot, Rasa has discussed an experimental paraphraser for augmenting text data here, Sentence-transfomers offers a paraphrase mining utility and NLPAug offers word level augmentation with a PPDB (a multi-million paraphrase database). While these attempts at paraphrasing are great, there are still some gaps and paraphrasing is NOT yet a mainstream option for text augmentation in building NLU models. Parrot is a humble attempt to fill some of these gaps.
What is a good paraphrase? Almost all conditioned text generation models are validated on 2 factors: (1) if the generated text conveys the same meaning as the original context (Adequacy); (2) if the text is fluent / grammatically correct English (Fluency). For instance, Neural Machine Translation outputs are tested for Adequacy and Fluency. But a good paraphrase should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the 3 key metrics that measure the quality of paraphrases are:
- Adequacy (Is the meaning preserved adequately?)
- Fluency (Is the paraphrase fluent English?)
- Diversity (Lexical / Phrasal / Syntactical) (How much has the paraphrase changed the original sentence?)
Parrot offers knobs to control Adequacy, Fluency and Diversity as per your needs.
What makes a paraphraser a good augmentor? For training an NLU model, we not only need a large number of utterances but also utterances with intents and slots/entities annotated. The typical flow would be:
- Given an input utterance + input annotations, a good augmentor spits out N output paraphrases while preserving the intent and slots.
- The output paraphrases are then converted into annotated data using the input annotations obtained in step 1.
- The annotated data created from the output paraphrases then forms the training dataset for your NLU model.
However, as a generative model, paraphrasers generally do not guarantee to preserve the slots/entities. Therefore, the ability to generate high-quality paraphrases in a constrained manner without sacrificing intents and slots for lexical dissimilarity makes a paraphraser a good augmentor. More on this in the following section.
Scope
In the realm of conversational engines, knowledge bots are those to which we ask questions like "when was the Berlin wall teared down?", transactional bots are those to which we give commands like "Turn on the music please", and voice assistants can both answer questions and execute our commands. Parrot mainly focuses on augmenting texts entered or spoken into conversational interfaces for building robust NLU models. (Typically, people do not type or shout long paragraphs into conversational interfaces. Hence, the pre-trained model is trained on text samples with a maximum length of 32.)
While Parrot predominantly aims to be a text augmentor for building good NLU models, it can also be used as a pure-play paraphraser.