Parrot開源釋義框架 - 基於T5生成高質量釋義，加速NLU模型訓練

首頁

Parrot Paraphraser On T5

由prithivida開發

Parrot是一個基於T5的釋義框架，專為加速訓練自然語言理解(NLU)模型而設計，通過生成高質量釋義實現數據增強。

文本生成

Transformers

#NLU訓練加速 #多參數可控釋義 #對話界面增強

下載量 910.07k

發布時間 : 3/2/2022

模型概述

Parrot是一個話語增強框架，通過生成保留語義的多樣化釋義來擴充NLU訓練數據，支持調節充分性、流暢性和多樣性參數。

模型特點

三指標優化

同時優化釋義的充分性（語義保留）、流暢性（語法正確）和多樣性（詞彙/句法變化）

參數可調

支持調節多樣性排名器、返回短語數量、長度限制等參數以適應不同需求

NLU專用增強

專注於對話系統輸入文本的增強，生成適合NLU模型訓練的短文本（最大長度32）

模型能力

文本釋義生成

自然語言理解數據增強

多語言文本改寫

使用案例

對話系統開發

意圖分類數據擴充

為有限標註的意圖分類任務生成多樣化訓練樣本

提升模型泛化能力，減少過擬合

槽位保留增強

生成保留關鍵實體槽位的釋義變體

在不破壞標註結構的前提下擴充數據

教育應用

語言學習材料生成

為同一問題創建多種表達方式

幫助學習者掌握多樣化表達

🚀 Parrot

Parrot是一個基於釋義的話語擴充框架，專為加速自然語言理解（NLU）模型的訓練而設計。釋義框架不僅僅是一個釋義模型。有關該庫及其使用的更多詳細信息，請參考GitHub頁面。

🚀 快速開始

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

''' 
取消註釋以獲得可復現的釋義生成
def random_state(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

random_state(1234)
'''

# 初始化模型（如果將其集成到代碼中，請確保僅初始化一次）
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)

phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase)
  for para_phrase in para_phrases:
   print(para_phrase)

運行上述代碼，示例輸出如下：

----------------------------------------------------------------------
Input_phrase: Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------
list some excellent restaurants to visit in new york city?
what upscale restaurants do you recommend in new york?
i want to try some upscale restaurants in new york?
recommend some upscale restaurants in newyork?
can you recommend some high end restaurants in newyork?
can you recommend some upscale restaurants in new york?
can you recommend some upscale restaurants in newyork?
----------------------------------------------------------------------
Input_phrase: What are the famous places we should not miss in Russia
----------------------------------------------------------------------
what should we not miss when visiting russia?
recommend some of the best places to visit in russia?
list some of the best places to visit in russia?
can you list the top places to visit in russia?
show the places that we should not miss in russia?
list some famous places which we should not miss in russia?

📦 安裝指南

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

💻 使用示例

基礎用法

from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

# 初始化模型（如果將其集成到代碼中，請確保僅初始化一次）
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)

phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase)
  for para_phrase in para_phrases:
   print(para_phrase)

高級用法

para_phrases = parrot.augment(input_phrase=phrase, 
                               diversity_ranker="levenshtein",
                               do_diverse=False, 
                               max_return_phrases = 10, 
                               max_length=32, 
                               adequacy_threshold = 0.99, 
                               fluency_threshold = 0.90)

✨ 主要特性

填補現有釋義工具的空白

Huggingface列出了12個釋義模型，RapidAPI列出了7個收費和商業釋義工具，如QuillBot，Rasa在此處討論了一個用於擴充文本數據的實驗性釋義工具，Sentence - transfomers提供了一個釋義挖掘工具，NLPAug通過PPDB（一個包含數百萬條釋義的數據庫）提供詞級擴充。雖然這些釋義嘗試都很不錯，但仍存在一些差距，釋義在構建NLU模型時還不是文本擴充的主流選擇。Parrot旨在填補這些空白。

可控制釋義質量

一個好的釋義需要滿足三個關鍵指標：

充分性（是否充分保留了原意？）
流暢性（釋義是否是流暢的英語？）
多樣性（詞彙/短語/句法）（釋義對原句的改動有多大？）

Parrot提供了參數來根據你的需求控制充分性、流暢性和多樣性。

優秀的擴充能力

對於訓練NLU模型，我們不僅需要大量的話語，還需要帶有意圖和槽位/實體標註的話語。一個好的擴充器應具備以下能力：

給定一個輸入話語 + 輸入標註，能夠輸出N個釋義話語，同時保留意圖和槽位。
輸出的釋義話語隨後使用步驟1中的輸入標註轉換為標註數據。
由輸出釋義話語創建的標註數據可作為NLU模型的訓練數據集。

一般來說，作為生成模型的釋義器不能保證保留槽位/實體。因此，Parrot能夠在不犧牲意圖和槽位的情況下，以受限的方式生成高質量的釋義，使其成為一個優秀的擴充器。

🔧 技術細節

適用場景

在對話引擎領域，知識機器人用於回答問題，如“柏林牆是什麼時候拆除的？”，事務機器人用於執行命令，如“請打開音樂”，語音助手則可以同時回答問題和執行命令。Parrot主要專注於擴充輸入到或說給對話界面的文本，以構建強大的NLU模型。（通常人們不會向對話界面輸入或說出長篇段落，因此預訓練模型是在最大長度為32的文本樣本上進行訓練的。）