flan-t5-large-grammar-synthesis開源文本模型 - 免費實現語法校正且保留語義

首頁

Flan T5 Large Grammar Synthesis

由pszemraj開發

基於google/flan-t5-large微調的文本到文本模型，專注於語法校正任務，能夠處理包含大量錯誤的文本而不改變語法正確文本的語義。

大型語言模型 #語法校正 #文本糾錯 #語言模型優化

下載量 25.07k

發布時間 : 11/26/2022

模型概述

該模型主要用於單次語法校正，特別適合處理可能有大量語法錯誤的文本，同時確保不改變語法正確文本的原始信息。

模型特點

單次語法校正

能夠一次性校正文本中的多種語法錯誤，包括拼寫、標點和結構問題。

語義保留

在修正語法錯誤的同時，確保不改變原始文本的語義信息。

批量處理能力

支持批量處理多個句子或短段落，提高處理效率。

ONNX支持

提供ONNX格式檢查點，可使用optimum庫進行更高效的推理。

模型能力

語法錯誤修正

拼寫校正

標點修正

句子結構優化

文本規範化

使用案例

文本處理

音頻轉錄校正

校正自動語音識別(ASR)系統輸出的轉錄文本中的語法錯誤。

提高轉錄文本的可讀性和準確性

聊天機器人響應優化

修正聊天機器人生成的文本中的語法錯誤，提高對話質量。

使對話更自然流暢

OCR後處理

校正光學字符識別(OCR)系統輸出的文本錯誤。

提高OCR輸出文本的準確性

教育

寫作輔助

幫助學生或非母語者識別和修正寫作中的語法錯誤。

提高寫作質量

🚀 語法合成大模型：FLAN - t5

本項目是 google/flan - t5 - large 的微調版本，用於在擴展版的 JFLEG 數據集上進行語法糾正。你可以在 HF 空間查看演示。

🚀 快速開始

安裝依賴

在運行代碼前，你需要安裝 transformers 庫：

pip install transformers

代碼示例

運行以下代碼進行語法糾正：

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

批量推理

有關批量推理的詳細信息，請參閱此討論線程。基本上，數據集一次包含多個句子，因此建議以相同的方式進行推理：批量處理 64 - 96 個左右的標記（或者，使用正則表達式分割 2 - 3 個句子）。

⚠️ 重要提示

在使用 text2text 模型之前，先檢查給定句子是否需要語法糾正會很有幫助。你可以使用在 CoLA 上微調的 BERT 類型模型，如 textattack/roberta-base-CoLA 來完成此操作。

你可以查看這裡的筆記本，瞭解批量推理的演示。

✨ 主要特性

語法糾正能力

該模型旨在創建一個 text2text 語言模型，能夠對可能存在大量語法錯誤的文本成功完成“單次語法糾正”，並且不會對語法正確的文本/信息進行語義更改。與其他語法糾正模型上的一些高錯誤示例進行比較，你就能看出差異。

ONNX 支持

此模型已轉換為 ONNX 格式，可以使用 huggingface 的 optimum 庫進行加載和使用。

安裝 `optimum`

pip install optimum[onnxruntime]
# 如果你想使用不同的運行時，請閱讀其文檔

加載模型

from optimum.pipelines import pipeline

corrector = pipeline(
    "text2text-generation", model=corrector_model_name, accelerator="ort"
)
# 正常使用

不同檢查點

如果你願意為了更快的推理速度而犧牲一些語法糾正質量，可以考慮使用從相關 t5 檢查點微調而來的 基礎版 和 小型版 檢查點。

💻 使用示例

基礎用法

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

高級用法

# 批量推理示例，可參考討論線程和演示筆記本
# 批量推理時，建議批量處理 64 - 96 個左右的標記（或者，使用正則表達式分割 2 - 3 個句子）
# 先使用 BERT 類型模型檢查句子是否需要語法糾正
from transformers import pipeline

# 語法檢查模型
checker = pipeline('text-classification', model='textattack/roberta-base-CoLA')
corrector = pipeline('text2text-generation', model='pszemraj/flan-t5-large-grammar-synthesis')

sentences = ["i can has cheezburger", "There car broke down so their hitching a ride to they're class."]
for sentence in sentences:
    need_correction = checker(sentence)[0]['label'] == 'LABEL_0'
    if need_correction:
        result = corrector(sentence)
        print(result)

📚 詳細文檔

模型描述

該模型的目標是創建一個 text2text 語言模型，能夠對可能存在大量語法錯誤的文本成功完成“單次語法糾正”，同時不會對語法正確的文本/信息進行語義更改。

數據集和示例

數據集：使用擴展版的 JFLEG 數據集。
示例：以下是一些示例文本及其標題：
- 複合句示例 1："There car broke down so their hitching a ride to they're class."
- 芝士漢堡示例："i can has cheezburger"
- 轉錄音頻示例 2："so em if we have an now so with fito ringina know how to estimate the tren given the ereafte mylite trend we can also em an estimate is nod s i again tort watfettering an we have estimated the trend an called wot to be called sthat of exty right now we can and look at wy this should not hare a trend i becan we just remove the trend an and we can we now estimate tesees ona effect of them exty"
- 錯誤詞彙選擇（上下文）示例："My coworker said he used a financial planner to help choose his stocks so he wouldn't loose money."
- 小寫音頻轉錄輸出示例："good so hve on an tadley i'm not able to make it to the exla session on monday this week e which is why i am e recording pre recording an this excelleision and so to day i want e to talk about two things and first of all em i wont em wene give a summary er about ta ohow to remove trents in these nalitives from time series"
- 懸垂修飾語示例："Frustrated, the chairs took me forever to set up."
- 拼寫錯誤示例："I would like a peice of pie."
- 關於蘇黎世的聊天機器人示例："Which part of Zurich was you going to go hiking in when we were there for the first time together? ! ?"
- 社會科學 ASR 摘要輸出示例："Most of the course is about semantic or content of language but there are also interesting topics to be learned from the servicefeatures except statistics in characters in documents. At this point, Elvthos introduces himself as his native English speaker and goes on to say that if you continue to work on social scnce,"
- 醫學課程音頻轉錄示例："they are somewhat nearby right yes please i'm not sure how the innish is tepen thut mayyouselect one that istatte lo variants in their property e ere interested and anyone basical e may be applyind reaching the browing approach were"

使用場景

糾正高錯誤率的語言模型輸出：例如音頻轉錄（ASR）或手寫 OCR 輸出。根據所使用的模型/系統，在 OCR 處理後的文本上應用此模型可能是值得的。
糾正文本生成模型的輸出：使生成的文本更連貫，消除明顯的錯誤，避免破壞對話沉浸感。例如，在這個 OPT 2.7B 聊天機器人模型的輸出上使用該模型。
修復所謂的“扭曲短語”：這些短語是語言模型生成文本的明顯標誌。不過，有些短語可能無法修復，特別是涉及特定領域術語的短語。

🔧 技術細節

參數設置

屬性	詳情
最大長度	128
最小長度	4
束搜索數量	8
重複懲罰	1.21
長度懲罰	1
提前停止	是

侷限性

數據集許可證：cc - by - nc - sa - 4.0
模型許可證：apache - 2.0
該模型仍在開發中，雖然在很多情況下可能對“單次語法糾正”有用，但請檢查輸出的正確性。

📄 許可證

本項目使用的數據集遵循 cc - by - nc - sa - 4.0 許可證，模型遵循 apache - 2.0 許可證。

📚 引用信息

如果你在工作中發現這個微調模型很有用，請考慮引用它：

@misc {peter_szemraj_2022,
	author       = { {Peter Szemraj} },
	title        = { flan-t5-large-grammar-synthesis (Revision d0b5ae2) },
	year         = 2022,
	url          = { https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis },
	doi          = { 10.57967/hf/0138 },
	publisher    = { Hugging Face }
}