đ Targeted Paraphrasing Model for Adversarial Data Generation
This repository offers the (UN)-Targeted Paraphrasing Model, developed as part of the research presented in the paper:
"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion."
The model is designed to generate high - quality paraphrases with improved fluency, diversity, and relevance, and is specifically tailored for adversarial data generation applications.
đ Documentation
đ Paraphrasing Datasets
The training process employed a carefully curated dataset consisting of 560,550 paraphrase pairs from seven high - quality sources:
- APT Dataset (Nighojkar and Licato, 2021)
- Microsoft Research Paraphrase Corpus (MSRP) (Dolan and Brockett, 2005)
- PARANMT - 50M (Wieting and Gimpel, 2018)
- TwitterPPDB (Lan et al., 2017)
- PIT - 2015 (Xu et al., 2015)
- PARADE (He et al., 2020)
- Quora Question Pairs (QQP) (Iyer et al., 2017)
Filtering steps were taken to ensure high - quality and diverse data:
- Pairs with over 50% unigram overlap were removed to enhance lexical diversity.
- Pairs with less than 50% reordering of shared words were eliminated for syntactic diversity.
- Pairs with less than 50% semantic similarity were filtered out, using cosine similarity scores from the "all - MiniLM - L12 - v2" model.
- Pairs with over 70% trigram overlap were discarded to improve diversity.
The refined dataset contains 96,073 samples, split into training (76,857), validation (9,608), and testing (9,608) subsets.
đ§ Model Description
The paraphrasing model is based on FLAN - T5 - large and fine - tuned on the filtered dataset for nine epochs. Key features include:
- Performance: It achieves an F1 BERT - Score of 75.925%, indicating excellent fluency and paraphrasing ability.
- Task - Specificity: Focused training on relevant pairs ensures high - quality task - specific outputs.
- Enhanced Generation: It generates paraphrases that introduce new information about entities or objects, improving the overall generation quality.
đŧ Applications
This model is mainly designed to create adversarial training samples that can effectively uncover edge cases in machine learning models while maintaining minimal distribution distortion.
Additionally, the model is suitable for general paraphrasing purposes, making it a versatile tool for generating high - quality paraphrases in various contexts. It is compatible with the Parrot paraphrasing library for seamless integration and usage.
đĻ Installation
To install the Parrot library, run:
pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
đģ Usage Examples
đ Basic Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "alykassem/FLAN-T5-Paraphraser"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
input_text = "Paraphrase: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text:", decoded_output)
âī¸ Advanced Usage
from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")
parrot = Parrot(model_tag="alykassem/FLAN-T5-Paraphraser", use_gpu=True)
phrases = [
"Can you recommend some upscale restaurants in New York?",
"What are the famous places we should not miss in Russia?"
]
for phrase in phrases:
print("-" * 100)
print("Input Phrase: ", phrase)
print("-" * 100)
para_phrases = parrot.augment(input_phrase=phrase)
for para_phrase in para_phrases:
print(para_phrase)
đ Citation
If you find this work or model useful, please cite the paper:
@inproceedings{kassem-saad-2024-finding,
title = "Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion",
author = "Kassem, Aly and
Saad, Sherif",
editor = "Graham, Yvette and
Purver, Matthew",
booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.eacl-long.33/",
pages = "552--572",
}
đ License
This project is licensed under the Apache - 2.0 license.
đ Metadata
Property |
Details |
Base Model |
google/flan - t5 - large |
Pipeline Tag |
text2text - generation |
Metrics |
bertscore |