FLAN-T5-Paraphraser Open-Source Text Paraphrasing Model - Free Deployment to Generate Diverse and High-Quality Paraphrased Texts

FLAN T5 Paraphraser

Developed by alykassem

A text rewriting model based on the FLAN-T5-large architecture, specifically designed to generate high-quality, high-fluency, diverse, and relevant rewritten texts, particularly suitable for adversarial data generation scenarios.

Text Generation

PyTorch

EnglishOpen Source License:Apache-2.0 #Adversarial Rewriting #Edge Case Generation #High-Fluency Rewriting

Downloads 75

Release Time : 1/3/2025

Model Overview

This model was developed to support academic research, focusing on generating high-quality rewritten texts, especially for adversarial data generation scenarios. It can identify edge cases in machine learning models while minimizing distribution distortion.

Model Features

High-Quality Rewriting

The model generates rewritten texts with high fluency, diversity, and relevance, capable of introducing new information about entities or objects.

Adversarial Data Generation

Particularly suitable for creating adversarial training samples, effectively identifying edge cases in machine learning models.

Diverse Training Data

The training process utilized a curated dataset of 560,550 rewriting pairs from seven high-quality sources, ensuring data quality and diversity.

Outstanding Performance

Achieved an F1 BERT score of 75.925%, demonstrating exceptional fluency and rewriting capabilities.

Model Capabilities

Text Rewriting

Adversarial Data Generation

Edge Case Discovery

Use Cases

Adversarial Training

Adversarial Sample Generation

Generate adversarial training samples for testing and improving the robustness of machine learning models.

Effectively identifies edge cases in machine learning models while maintaining minimal distribution distortion.

General Text Rewriting

Text Diversification

Generate diverse text rewrites suitable for content creation, data augmentation, and other scenarios.

The rewritten texts exhibit high fluency, diversity, and relevance.

🚀 Targeted Paraphrasing Model for Adversarial Data Generation

This repository offers the (UN)-Targeted Paraphrasing Model, developed as part of the research presented in the paper:
"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion."

The model is designed to generate high - quality paraphrases with improved fluency, diversity, and relevance, and is specifically tailored for adversarial data generation applications.

📚 Documentation

🔍 Paraphrasing Datasets

The training process employed a carefully curated dataset consisting of 560,550 paraphrase pairs from seven high - quality sources:

APT Dataset (Nighojkar and Licato, 2021)
Microsoft Research Paraphrase Corpus (MSRP) (Dolan and Brockett, 2005)
PARANMT - 50M (Wieting and Gimpel, 2018)
TwitterPPDB (Lan et al., 2017)
PIT - 2015 (Xu et al., 2015)
PARADE (He et al., 2020)
Quora Question Pairs (QQP) (Iyer et al., 2017)

Filtering steps were taken to ensure high - quality and diverse data:

Pairs with over 50% unigram overlap were removed to enhance lexical diversity.
Pairs with less than 50% reordering of shared words were eliminated for syntactic diversity.
Pairs with less than 50% semantic similarity were filtered out, using cosine similarity scores from the "all - MiniLM - L12 - v2" model.
Pairs with over 70% trigram overlap were discarded to improve diversity.

The refined dataset contains 96,073 samples, split into training (76,857), validation (9,608), and testing (9,608) subsets.

🧠 Model Description

The paraphrasing model is based on FLAN - T5 - large and fine - tuned on the filtered dataset for nine epochs. Key features include:

Performance: It achieves an F1 BERT - Score of 75.925%, indicating excellent fluency and paraphrasing ability.
Task - Specificity: Focused training on relevant pairs ensures high - quality task - specific outputs.
Enhanced Generation: It generates paraphrases that introduce new information about entities or objects, improving the overall generation quality.

💼 Applications

This model is mainly designed to create adversarial training samples that can effectively uncover edge cases in machine learning models while maintaining minimal distribution distortion.

Additionally, the model is suitable for general paraphrasing purposes, making it a versatile tool for generating high - quality paraphrases in various contexts. It is compatible with the Parrot paraphrasing library for seamless integration and usage.

📦 Installation

To install the Parrot library, run:

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

💻 Usage Examples

🔍 Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "alykassem/FLAN-T5-Paraphraser"  

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example usage: Tokenize input and generate output
input_text = "Paraphrase: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated text:", decoded_output)

⚙️ Advanced Usage

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

# Uncomment to get reproducible paraphrase generations
# def random_state(seed):
#     torch.manual_seed(seed)
#     if torch.cuda.is_available():
#         torch.cuda.manual_seed_all(seed)

# random_state(1234)

# Initialize the Parrot model (ensure initialization occurs only once in your code)
parrot = Parrot(model_tag="alykassem/FLAN-T5-Paraphraser", use_gpu=True)

phrases = [
    "Can you recommend some upscale restaurants in New York?",
    "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
    print("-" * 100)
    print("Input Phrase: ", phrase)
    print("-" * 100)
    para_phrases = parrot.augment(input_phrase=phrase)
    for para_phrase in para_phrases:
        print(para_phrase)

📑 Citation

If you find this work or model useful, please cite the paper:

@inproceedings{kassem-saad-2024-finding,
    title = "Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion",
    author = "Kassem, Aly  and
      Saad, Sherif",
    editor = "Graham, Yvette  and
      Purver, Matthew",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.33/",
    pages = "552--572",
}

📄 License

This project is licensed under the Apache - 2.0 license.

📋 Metadata

Property	Details
Base Model	google/flan - t5 - large
Pipeline Tag	text2text - generation
Metrics	bertscore

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご