Paraphrase - Dipper - No - Ctx: An open - source text paraphrasing model for lightweight and flexible text rewriting!

Paraphrase Dipper No Ctx

Developed by SamSJackson

A lightweight text rewriting model based on google/t5-efficient-large-nl32, serving as the non-context-aware version of the DIPPER model

Text Generation

Transformers

Open Source License:MIT #Text Rewriting #Lightweight Model #Non-Context-Aware

Downloads 31

Release Time : 3/19/2024

Model Overview

This model is designed for controllable text rewriting, adjusting rewriting strength through vocabulary and word order parameters, acting as a lightweight alternative to the original DIPPER model

Model Features

Lightweight Alternative

As a lightweight version of the DIPPER model, it reduces usage costs

Controllable Rewriting

Precisely controls rewriting strength through vocabulary and word order parameters

Non-Context-Aware

Training data lacks contextual information, focusing on sentence-level rewriting

Model Capabilities

Text Rewriting

Text Summarization

AI-Generated Text Detection Evasion

Use Cases

Text Processing

AI-Generated Text Rewriting

Rewriting AI-generated text to evade detectors

Outperforms ChatGPT rewriters in perplexity scores on the facebook/opt-2.7b model

Content Rephrasing

Rewriting text while preserving original meaning

🚀 Model Card for Transformer Paraphraser

This model card presents a lightweight paraphraser model. It aims to provide a lower - cost alternative for paraphrasing tasks, built upon existing T5 models and finetuned on specific datasets.

🚀 Quick Start

Use the following code to start using the model:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")

model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)

text = "Each Wednesdsay, I take my dog for a walk in Central Park."

lexical = 20
order = 40

prompt = f"lexical = {lexical}, order = {order} {text}"

input_ids = tokenizer(
    prompt,
    return_tensors='pt',
    padding="longest",
    max_length=1000,
    truncation=True,
).to(device)

outputs = model.generate(
    **input_ids,
    top_p=0.75,
    do_sample=True,
    max_new_tokens=300,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"

print(response)

✨ Features

Lightweight: Built for lower - cost usage, making it accessible for various applications.
Non - context Equivalent: Can perform paraphrasing without relying on context, which is useful in specific scenarios.
Controllable Paraphrasing: Allows users to control the degree of lexical and order changes in paraphrasing.

📦 Installation

Since this is a 🤗 transformers model, you can install the necessary dependencies using the following command:

pip install transformers torch

💻 Usage Examples

Basic Usage

The basic usage is shown in the quick start code above. You can adjust the lexical and order parameters to control the paraphrasing strength.

Advanced Usage

You can further experiment with different generation parameters such as top_p, do_sample, and max_new_tokens to get different paraphrasing results. For example:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")

model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)

text = "The sun rises in the east."

lexical = 60
order = 80

prompt = f"lexical = {lexical}, order = {order} {text}"

input_ids = tokenizer(
    prompt,
    return_tensors='pt',
    padding="longest",
    max_length=1000,
    truncation=True,
).to(device)

outputs = model.generate(
    **input_ids,
    top_p=0.9,
    do_sample=True,
    max_new_tokens=500,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"

print(response)

📚 Documentation

Model Details

Model Description

This is a 🤗 transformers model card for a paraphraser. The model is based on google/t5-large-nl32 and finetuned on 100,000 non - context datapoints.

Developed by: Sam Jackson
Model type: Sequence - to - Sequence Model
Language(s) (NLP): English
License: MIT
Finetuned from model [optional]: google/t5-efficient-large-nl32

Model Sources

Repository: Original Github
Paper [optional]: Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

Uses

The model is intended for paraphrasing with control over lexical and order changes. The dataset used encourages the adjustment of these parameters to control the strength of paraphrasing.

Direct Use

The model can be used directly without further finetuning, although it is possible to finetune it if needed.

Downstream Use

Since this model is finetuned from a T5 checkpoint, it can be further finetuned for specific tasks. If you plan for transfer learning, it is recommended to start from the initial checkpoint model: google/t5-large-nl32.

Recommendations

If you have the capacity, it is recommended to use the more powerful model: DIPPER. Otherwise, this model is sufficiently strong and outperforms the sentence - based paraphraser ChatGPT Paraphraser in terms of perplexity scores when compared using the facebook/opt - 2.7b model.

Training Details

Training Data

The training data is available at kpar3 - no - ctx. Pre - processing involves tokenization using the google/t5 - efficient - large - nl32 tokenizer. The data consists of classic paraphrase pairs, with the first element in the pair having "lexical = x" and "order = y" terms, where x and y are in the set {0, 20, 40, 60, 80, 100} and denote the paraphrasing strength.

Training Hyperparameters

learning_rate = 1e-4
bf16 = True
num_train_epochs = 2
auto_find_batch_size = True,
generation_num_beams = 2,
generation_max_length = 200

Speeds, Sizes, Times

Finetuning on 100,000 datapoints took around 14 GPU hours using a GTX 3090.

🔧 Technical Details

The model is built from google/t5-large-nl32 and finetuned on a specific non - context dataset. The finetuning process uses specific hyperparameters to achieve the desired paraphrasing performance. The dataset's design with "lexical" and "order" parameters allows for controllable paraphrasing.

📄 License

This model is released under the MIT license.

Citation [optional]

BibTeX:

@misc{krishna2023paraphrasing,
      title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense}, 
      author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer},
      year={2023},
      eprint={2303.13408},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Contact

Contact me through huggingface if you have any questions.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご