paraphraser-bart-large Open Source Automatic Paraphrasing Model - Free to Generate High-Quality English Sentence Paraphrases

Paraphraser Bart Large

Developed by stanford-oval

An automatic paraphrase model based on BART-large architecture, trained on the ParaBank 2 dataset, capable of generating high-quality English sentence paraphrases

Text Generation

Transformers

Open Source License:Apache-2.0 #Database semantic parsing #Synthetic data training #High-quality paraphrase generation

Downloads 289

Release Time : 8/5/2022

Model Overview

This model is specifically designed to generate diverse paraphrase versions of given English sentences, trained with back-translation techniques to ensure grammatical correctness

Model Features

High-quality paraphrase generation

Trained on the purified ParaBank 2 dataset to generate paraphrases that conform to standard English expressions

Diverse output control

Supports adjusting the diversity and variation of paraphrases through the temperature parameter

Grammatical correctness guarantee

Uses human-written English sentences as output targets during training to ensure grammatical standards in generated results

Model Capabilities

English sentence paraphrasing

Text rewriting

Semantic-preserving transformation

Use Cases

Text augmentation

Data augmentation

Generate variants of training data for NLP tasks

Improves model generalization

Content rewriting

Generate different expressions that maintain the original meaning

Avoids text duplication

Educational applications

Language learning

Provide learners with different expressions of the same meaning

Enriches language expression skills

🚀 Automatic Paraphrasing Model

This is an automatic paraphrasing model described and used in a research paper, aiming to generate high - quality sentence - level paraphrases.

🚀 Quick Start

The automatic paraphrasing model is introduced in the paper "AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data" (EMNLP 2020).

✨ Features

High - quality paraphrasing: Trained on a carefully selected and cleaned dataset to ensure high - quality paraphrasing output.
Sentence - level processing: Specialized for sentence - level paraphrasing, providing accurate and diverse results.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Using top_p=0.9 and temperature between 0 and 1 usually results in good generated paraphrases.

# Example code can be added here according to actual API usage
# For now, assume we have a function named 'paraphrase'
# paraphrase(sentence, top_p=0.9, temperature=0.7)

Advanced Usage

Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence.

# Example code for advanced usage
# paraphrase(sentence, top_p=0.9, temperature=0.9)

⚠️ Important Note

This is a sentence - level paraphraser. If you want to paraphrase longer inputs (like paragraphs) with this model, make sure to first break the input into individual sentences.

📚 Documentation

Training data

Dataset source: A cleaned version of the ParaBank 2 dataset introduced in "[Large - Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19 - 1005/)". ParaBank 2 is a paraphrasing dataset constructed by back - translating the Czech portion of an English - Czech parallel corpus.
Data selection: We use a subset of 5 million sentence pairs with the highest dual conditional cross - entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence.
Data cleaning: The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc.

Training Procedure

Checkpoint: The model is fine - tuned for 4 epochs on the above - mentioned dataset, starting from facebook/bart - large checkpoint.
Loss function: We use token - level cross - entropy loss calculated using the gold paraphrase sentence.
Input - output design: To ensure the output of the model is grammatical, during training, we use the back - translated Czech sentence as the input and the human - written English sentence as the output.
Mini - batch construction: Training is done with mini - batches of 1280 examples. For higher training efficiency, each mini - batch is constructed by grouping sentences of similar length together.

Model Information

Property	Details
Model Type	Automatic Paraphrasing Model
Training Data	Cleaned ParaBank 2 dataset (subset of 5 million sentence pairs)

📄 License

This model is licensed under the Apache - 2.0 license.

📖 Citation

If you are using this model in your work, please use this citation:

@inproceedings{xu-etal-2020-autoqa,
    title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",
    author = "Xu, Silei  and Semnani, Sina  and Campagna, Giovanni  and Lam, Monica",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",
    pages = "422--434",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご