flan-t5-large-grammar-synthesis开源文本模型 - 免费实现语法校正且保留语义

首页

Flan T5 Large Grammar Synthesis

由 pszemraj 开发

基于google/flan-t5-large微调的文本到文本模型，专注于语法校正任务，能够处理包含大量错误的文本而不改变语法正确文本的语义。

大型语言模型 #语法校正 #文本纠错 #语言模型优化

下载量 25.07k

发布时间 : 11/26/2022

模型简介

该模型主要用于单次语法校正，特别适合处理可能有大量语法错误的文本，同时确保不改变语法正确文本的原始信息。

模型特点

单次语法校正

能够一次性校正文本中的多种语法错误，包括拼写、标点和结构问题。

语义保留

在修正语法错误的同时，确保不改变原始文本的语义信息。

批量处理能力

支持批量处理多个句子或短段落，提高处理效率。

ONNX支持

提供ONNX格式检查点，可使用optimum库进行更高效的推理。

模型能力

语法错误修正

拼写校正

标点修正

句子结构优化

文本规范化

使用案例

文本处理

音频转录校正

校正自动语音识别(ASR)系统输出的转录文本中的语法错误。

提高转录文本的可读性和准确性

聊天机器人响应优化

修正聊天机器人生成的文本中的语法错误，提高对话质量。

使对话更自然流畅

OCR后处理

校正光学字符识别(OCR)系统输出的文本错误。

提高OCR输出文本的准确性

教育

写作辅助

帮助学生或非母语者识别和修正写作中的语法错误。

提高写作质量

🚀 语法合成大模型：FLAN - t5

本项目是 google/flan - t5 - large 的微调版本，用于在扩展版的 JFLEG 数据集上进行语法纠正。你可以在 HF 空间查看演示。

🚀 快速开始

安装依赖

在运行代码前，你需要安装 transformers 库：

pip install transformers

代码示例

运行以下代码进行语法纠正：

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

批量推理

有关批量推理的详细信息，请参阅此讨论线程。基本上，数据集一次包含多个句子，因此建议以相同的方式进行推理：批量处理 64 - 96 个左右的标记（或者，使用正则表达式分割 2 - 3 个句子）。

⚠️ 重要提示

在使用 text2text 模型之前，先检查给定句子是否需要语法纠正会很有帮助。你可以使用在 CoLA 上微调的 BERT 类型模型，如 textattack/roberta-base-CoLA 来完成此操作。

你可以查看这里的笔记本，了解批量推理的演示。

✨ 主要特性

语法纠正能力

该模型旨在创建一个 text2text 语言模型，能够对可能存在大量语法错误的文本成功完成“单次语法纠正”，并且不会对语法正确的文本/信息进行语义更改。与其他语法纠正模型上的一些高错误示例进行比较，你就能看出差异。

ONNX 支持

此模型已转换为 ONNX 格式，可以使用 huggingface 的 optimum 库进行加载和使用。

安装 `optimum`

pip install optimum[onnxruntime]
# 如果你想使用不同的运行时，请阅读其文档

加载模型

from optimum.pipelines import pipeline

corrector = pipeline(
    "text2text-generation", model=corrector_model_name, accelerator="ort"
)
# 正常使用

不同检查点

如果你愿意为了更快的推理速度而牺牲一些语法纠正质量，可以考虑使用从相关 t5 检查点微调而来的 基础版 和 小型版 检查点。

💻 使用示例

基础用法

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

高级用法

# 批量推理示例，可参考讨论线程和演示笔记本
# 批量推理时，建议批量处理 64 - 96 个左右的标记（或者，使用正则表达式分割 2 - 3 个句子）
# 先使用 BERT 类型模型检查句子是否需要语法纠正
from transformers import pipeline

# 语法检查模型
checker = pipeline('text-classification', model='textattack/roberta-base-CoLA')
corrector = pipeline('text2text-generation', model='pszemraj/flan-t5-large-grammar-synthesis')

sentences = ["i can has cheezburger", "There car broke down so their hitching a ride to they're class."]
for sentence in sentences:
    need_correction = checker(sentence)[0]['label'] == 'LABEL_0'
    if need_correction:
        result = corrector(sentence)
        print(result)

📚 详细文档

模型描述

该模型的目标是创建一个 text2text 语言模型，能够对可能存在大量语法错误的文本成功完成“单次语法纠正”，同时不会对语法正确的文本/信息进行语义更改。

数据集和示例

数据集：使用扩展版的 JFLEG 数据集。
示例：以下是一些示例文本及其标题：
- 复合句示例 1："There car broke down so their hitching a ride to they're class."
- 芝士汉堡示例："i can has cheezburger"
- 转录音频示例 2："so em if we have an now so with fito ringina know how to estimate the tren given the ereafte mylite trend we can also em an estimate is nod s i again tort watfettering an we have estimated the trend an called wot to be called sthat of exty right now we can and look at wy this should not hare a trend i becan we just remove the trend an and we can we now estimate tesees ona effect of them exty"
- 错误词汇选择（上下文）示例："My coworker said he used a financial planner to help choose his stocks so he wouldn't loose money."
- 小写音频转录输出示例："good so hve on an tadley i'm not able to make it to the exla session on monday this week e which is why i am e recording pre recording an this excelleision and so to day i want e to talk about two things and first of all em i wont em wene give a summary er about ta ohow to remove trents in these nalitives from time series"
- 悬垂修饰语示例："Frustrated, the chairs took me forever to set up."
- 拼写错误示例："I would like a peice of pie."
- 关于苏黎世的聊天机器人示例："Which part of Zurich was you going to go hiking in when we were there for the first time together? ! ?"
- 社会科学 ASR 摘要输出示例："Most of the course is about semantic or content of language but there are also interesting topics to be learned from the servicefeatures except statistics in characters in documents. At this point, Elvthos introduces himself as his native English speaker and goes on to say that if you continue to work on social scnce,"
- 医学课程音频转录示例："they are somewhat nearby right yes please i'm not sure how the innish is tepen thut mayyouselect one that istatte lo variants in their property e ere interested and anyone basical e may be applyind reaching the browing approach were"

使用场景

纠正高错误率的语言模型输出：例如音频转录（ASR）或手写 OCR 输出。根据所使用的模型/系统，在 OCR 处理后的文本上应用此模型可能是值得的。
纠正文本生成模型的输出：使生成的文本更连贯，消除明显的错误，避免破坏对话沉浸感。例如，在这个 OPT 2.7B 聊天机器人模型的输出上使用该模型。
修复所谓的“扭曲短语”：这些短语是语言模型生成文本的明显标志。不过，有些短语可能无法修复，特别是涉及特定领域术语的短语。

🔧 技术细节

参数设置

属性	详情
最大长度	128
最小长度	4
束搜索数量	8
重复惩罚	1.21
长度惩罚	1
提前停止	是

局限性

数据集许可证：cc - by - nc - sa - 4.0
模型许可证：apache - 2.0
该模型仍在开发中，虽然在很多情况下可能对“单次语法纠正”有用，但请检查输出的正确性。

📄 许可证

本项目使用的数据集遵循 cc - by - nc - sa - 4.0 许可证，模型遵循 apache - 2.0 许可证。

📚 引用信息

如果你在工作中发现这个微调模型很有用，请考虑引用它：

@misc {peter_szemraj_2022,
	author       = { {Peter Szemraj} },
	title        = { flan-t5-large-grammar-synthesis (Revision d0b5ae2) },
	year         = 2022,
	url          = { https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis },
	doi          = { 10.57967/hf/0138 },
	publisher    = { Hugging Face }
}