đ ChatGPT Paraphraser on T5-base
This project offers a high - quality paraphrasing model trained on diverse datasets, aiming to generate paraphrases as well as ChatGPT.
đ Quick Start
This model was trained on our ChatGPT paraphrase dataset. The dataset combines elements from the Quora paraphrase question, texts from the SQUAD 2.0, and the CNN news dataset.
Based on the T5 - base model, "transfer learning" was employed to enable the model to generate paraphrases comparable to ChatGPT. It stands as one of the best paraphrasing models on Hugging Face.
⨠Features
- High - Quality Paraphrasing: Capable of generating paraphrases similar to ChatGPT.
- Diverse Training Data: Trained on a wide range of datasets for better generalization.
- Based on T5 - base: Utilizes the power of the T5 - base model with transfer learning.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")
model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)
def paraphrase(
question,
num_beams=5,
num_beam_groups=5,
num_return_sequences=5,
repetition_penalty=10.0,
diversity_penalty=3.0,
no_repeat_ngram_size=2,
temperature=0.7,
max_length=128
):
input_ids = tokenizer(
f'paraphrase: {question}',
return_tensors="pt", padding="longest",
max_length=max_length,
truncation=True,
).input_ids.to(device)
outputs = model.generate(
input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
num_return_sequences=num_return_sequences, no_repeat_ngram_size=no_repeat_ngram_size,
num_beams=num_beams, num_beam_groups=num_beam_groups,
max_length=max_length, diversity_penalty=diversity_penalty
)
res = tokenizer.batch_decode(outputs, skip_special_tokens=True)
return res
Advanced Usage
Here are some actual input - output examples:
Input:
text = 'What are the best places to see in New York?'
paraphrase(text)
Output:
['What are some must-see places in New York?',
'Can you suggest some must-see spots in New York?',
'Where should one go to experience the best NYC has to offer?',
'Which places should I visit in New York?',
'What are the top destinations to explore in New York?']
Input:
text = "Rammstein's album Mutter was recorded in the south of France in May and June 2000, and mixed in Stockholm in October of that year."
paraphrase(text)
Output:
['In May and June 2000, Rammstein travelled to the south of France to record his album Mutter, which was mixed in Stockholm in October of that year.',
'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year.',
'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year. It',
'Mutter, the album released by Rammstein, was recorded in southern France during May and June 2000, with mixing taking place between October and September.',
'In May and June 2000, Rammstein recorded his album Mutter in the south of France, with the mix being made at Stockholm during October.']
đ§ Technical Details
Train parameters
epochs = 5
batch_size = 64
max_length = 128
lr = 5e-5
batches_qty = 196465
betas = (0.9, 0.999)
eps = 1e-08
Inference parameters
Property |
Details |
num_beams |
5 |
num_beam_groups |
5 |
num_return_sequences |
5 |
repetition_penalty |
10.01 |
diversity_penalty |
3.01 |
no_repeat_ngram_size |
2 |
temperature |
0.7 |
max_length |
128 |
BibTeX entry and citation info
@inproceedings{chatgpt_paraphraser,
author={Vladimir Vorobev, Maxim Kuznetsov},
title={A paraphrasing model based on ChatGPT paraphrases},
year={2023}
}
đ License
The license for this project is OpenRail.