Model Overview
Model Features
Model Capabilities
Use Cases
đ Model Card for Transformer Paraphraser
This model card presents a lightweight paraphraser model. It aims to provide a lower - cost alternative for paraphrasing tasks, built upon existing T5 models and finetuned on specific datasets.
đ Quick Start
Use the following code to start using the model:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")
model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)
text = "Each Wednesdsay, I take my dog for a walk in Central Park."
lexical = 20
order = 40
prompt = f"lexical = {lexical}, order = {order} {text}"
input_ids = tokenizer(
prompt,
return_tensors='pt',
padding="longest",
max_length=1000,
truncation=True,
).to(device)
outputs = model.generate(
**input_ids,
top_p=0.75,
do_sample=True,
max_new_tokens=300,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"
print(response)
⨠Features
- Lightweight: Built for lower - cost usage, making it accessible for various applications.
- Non - context Equivalent: Can perform paraphrasing without relying on context, which is useful in specific scenarios.
- Controllable Paraphrasing: Allows users to control the degree of lexical and order changes in paraphrasing.
đĻ Installation
Since this is a đ¤ transformers model, you can install the necessary dependencies using the following command:
pip install transformers torch
đģ Usage Examples
Basic Usage
The basic usage is shown in the quick start code above. You can adjust the lexical
and order
parameters to control the paraphrasing strength.
Advanced Usage
You can further experiment with different generation parameters such as top_p
, do_sample
, and max_new_tokens
to get different paraphrasing results. For example:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")
model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)
text = "The sun rises in the east."
lexical = 60
order = 80
prompt = f"lexical = {lexical}, order = {order} {text}"
input_ids = tokenizer(
prompt,
return_tensors='pt',
padding="longest",
max_length=1000,
truncation=True,
).to(device)
outputs = model.generate(
**input_ids,
top_p=0.9,
do_sample=True,
max_new_tokens=500,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"
print(response)
đ Documentation
Model Details
Model Description
This is a đ¤ transformers model card for a paraphraser. The model is based on google/t5-large-nl32 and finetuned on 100,000 non - context datapoints.
- Developed by: Sam Jackson
- Model type: Sequence - to - Sequence Model
- Language(s) (NLP): English
- License: MIT
- Finetuned from model [optional]: google/t5-efficient-large-nl32
Model Sources
- Repository: Original Github
- Paper [optional]: Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
Uses
The model is intended for paraphrasing with control over lexical and order changes. The dataset used encourages the adjustment of these parameters to control the strength of paraphrasing.
Direct Use
The model can be used directly without further finetuning, although it is possible to finetune it if needed.
Downstream Use
Since this model is finetuned from a T5 checkpoint, it can be further finetuned for specific tasks. If you plan for transfer learning, it is recommended to start from the initial checkpoint model: google/t5-large-nl32.
Recommendations
If you have the capacity, it is recommended to use the more powerful model: DIPPER. Otherwise, this model is sufficiently strong and outperforms the sentence - based paraphraser ChatGPT Paraphraser in terms of perplexity scores when compared using the facebook/opt - 2.7b model.
Training Details
Training Data
The training data is available at kpar3 - no - ctx. Pre - processing involves tokenization using the google/t5 - efficient - large - nl32 tokenizer. The data consists of classic paraphrase pairs, with the first element in the pair having "lexical = x" and "order = y" terms, where x and y are in the set {0, 20, 40, 60, 80, 100} and denote the paraphrasing strength.
Training Hyperparameters
learning_rate = 1e-4
bf16 = True
num_train_epochs = 2
auto_find_batch_size = True,
generation_num_beams = 2,
generation_max_length = 200
Speeds, Sizes, Times
Finetuning on 100,000 datapoints took around 14 GPU hours using a GTX 3090.
đ§ Technical Details
The model is built from google/t5-large-nl32 and finetuned on a specific non - context dataset. The finetuning process uses specific hyperparameters to achieve the desired paraphrasing performance. The dataset's design with "lexical" and "order" parameters allows for controllable paraphrasing.
đ License
This model is released under the MIT license.
Citation [optional]
BibTeX:
@misc{krishna2023paraphrasing,
title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense},
author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer},
year={2023},
eprint={2303.13408},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Model Card Contact
Contact me through huggingface if you have any questions.






