đ Automatic Paraphrasing Model
This is an automatic paraphrasing model described and used in a research paper, aiming to generate high - quality sentence - level paraphrases.
đ Quick Start
The automatic paraphrasing model is introduced in the paper "AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data" (EMNLP 2020).
⨠Features
- High - quality paraphrasing: Trained on a carefully selected and cleaned dataset to ensure high - quality paraphrasing output.
- Sentence - level processing: Specialized for sentence - level paraphrasing, providing accurate and diverse results.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
Using top_p=0.9
and temperature
between 0
and 1
usually results in good generated paraphrases.
Advanced Usage
Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence.
â ī¸ Important Note
This is a sentence - level paraphraser. If you want to paraphrase longer inputs (like paragraphs) with this model, make sure to first break the input into individual sentences.
đ Documentation
Training data
- Dataset source: A cleaned version of the ParaBank 2 dataset introduced in "[Large - Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19 - 1005/)". ParaBank 2 is a paraphrasing dataset constructed by back - translating the Czech portion of an English - Czech parallel corpus.
- Data selection: We use a subset of 5 million sentence pairs with the highest dual conditional cross - entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence.
- Data cleaning: The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc.
Training Procedure
- Checkpoint: The model is fine - tuned for 4 epochs on the above - mentioned dataset, starting from
facebook/bart - large
checkpoint.
- Loss function: We use token - level cross - entropy loss calculated using the gold paraphrase sentence.
- Input - output design: To ensure the output of the model is grammatical, during training, we use the back - translated Czech sentence as the input and the human - written English sentence as the output.
- Mini - batch construction: Training is done with mini - batches of 1280 examples. For higher training efficiency, each mini - batch is constructed by grouping sentences of similar length together.
Model Information
Property |
Details |
Model Type |
Automatic Paraphrasing Model |
Training Data |
Cleaned ParaBank 2 dataset (subset of 5 million sentence pairs) |
đ License
This model is licensed under the Apache - 2.0 license.
đ Citation
If you are using this model in your work, please use this citation:
@inproceedings{xu-etal-2020-autoqa,
title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",
author = "Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",
pages = "422--434",
}