🚀 语法合成大模型:FLAN - t5
本项目是 google/flan - t5 - large 的微调版本,用于在扩展版的 JFLEG 数据集上进行语法纠正。你可以在 HF 空间查看 演示。
🚀 快速开始
安装依赖
在运行代码前,你需要安装 transformers
库:
pip install transformers
代码示例
运行以下代码进行语法纠正:
from transformers import pipeline
corrector = pipeline(
'text2text-generation',
'pszemraj/flan-t5-large-grammar-synthesis',
)
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)
批量推理
有关批量推理的详细信息,请参阅 此讨论线程。基本上,数据集一次包含多个句子,因此建议以相同的方式进行推理:批量处理 64 - 96 个左右的标记(或者,使用正则表达式分割 2 - 3 个句子)。
⚠️ 重要提示
在使用 text2text 模型之前,先检查给定句子是否需要语法纠正会很有帮助。你可以使用在 CoLA 上微调的 BERT 类型模型,如 textattack/roberta-base-CoLA
来完成此操作。
你可以查看 这里 的笔记本,了解批量推理的演示。
✨ 主要特性
语法纠正能力
该模型旨在创建一个 text2text 语言模型,能够对可能存在大量语法错误的文本成功完成“单次语法纠正”,并且不会对语法正确的文本/信息进行语义更改。与 其他语法纠正模型 上的一些高错误示例进行比较,你就能看出差异。
ONNX 支持
此模型已转换为 ONNX 格式,可以使用 huggingface 的 optimum
库进行加载和使用。
安装 optimum
pip install optimum[onnxruntime]
加载模型
from optimum.pipelines import pipeline
corrector = pipeline(
"text2text-generation", model=corrector_model_name, accelerator="ort"
)
不同检查点
如果你愿意为了更快的推理速度而牺牲一些语法纠正质量,可以考虑使用从相关 t5 检查点微调而来的 基础版 和 小型版 检查点。
💻 使用示例
基础用法
from transformers import pipeline
corrector = pipeline(
'text2text-generation',
'pszemraj/flan-t5-large-grammar-synthesis',
)
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)
高级用法
from transformers import pipeline
checker = pipeline('text-classification', model='textattack/roberta-base-CoLA')
corrector = pipeline('text2text-generation', model='pszemraj/flan-t5-large-grammar-synthesis')
sentences = ["i can has cheezburger", "There car broke down so their hitching a ride to they're class."]
for sentence in sentences:
need_correction = checker(sentence)[0]['label'] == 'LABEL_0'
if need_correction:
result = corrector(sentence)
print(result)
📚 详细文档
模型描述
该模型的目标是创建一个 text2text 语言模型,能够对可能存在大量语法错误的文本成功完成“单次语法纠正”,同时不会对语法正确的文本/信息进行语义更改。
数据集和示例
- 数据集:使用扩展版的 JFLEG 数据集。
- 示例:以下是一些示例文本及其标题:
- 复合句示例 1:"There car broke down so their hitching a ride to they're class."
- 芝士汉堡示例:"i can has cheezburger"
- 转录音频示例 2:"so em if we have an now so with fito ringina know how to estimate the tren given the ereafte mylite trend we can also em an estimate is nod s i again tort watfettering an we have estimated the trend an called wot to be called sthat of exty right now we can and look at wy this should not hare a trend i becan we just remove the trend an and we can we now estimate tesees ona effect of them exty"
- 错误词汇选择(上下文)示例:"My coworker said he used a financial planner to help choose his stocks so he wouldn't loose money."
- 小写音频转录输出示例:"good so hve on an tadley i'm not able to make it to the exla session on monday this week e which is why i am e recording pre recording an this excelleision and so to day i want e to talk about two things and first of all em i wont em wene give a summary er about ta ohow to remove trents in these nalitives from time series"
- 悬垂修饰语示例:"Frustrated, the chairs took me forever to set up."
- 拼写错误示例:"I would like a peice of pie."
- 关于苏黎世的聊天机器人示例:"Which part of Zurich was you going to go hiking in when we were there for the first time together? ! ?"
- 社会科学 ASR 摘要输出示例:"Most of the course is about semantic or content of language but there are also interesting topics to be learned from the servicefeatures except statistics in characters in documents. At this point, Elvthos introduces himself as his native English speaker and goes on to say that if you continue to work on social scnce,"
- 医学课程音频转录示例:"they are somewhat nearby right yes please i'm not sure how the innish is tepen thut mayyouselect one that istatte lo variants in their property e ere interested and anyone basical e may be applyind reaching the browing approach were"
使用场景
- 纠正高错误率的语言模型输出:例如音频转录(ASR)或手写 OCR 输出。根据所使用的模型/系统,在 OCR 处理后的文本上应用此模型可能是值得的。
- 纠正文本生成模型的输出:使生成的文本更连贯,消除明显的错误,避免破坏对话沉浸感。例如,在 这个 OPT 2.7B 聊天机器人模型 的输出上使用该模型。
- 修复所谓的“扭曲短语”:这些短语是语言模型生成文本的明显标志。不过,有些短语可能无法修复,特别是涉及特定领域术语的短语。
🔧 技术细节
参数设置
属性 |
详情 |
最大长度 |
128 |
最小长度 |
4 |
束搜索数量 |
8 |
重复惩罚 |
1.21 |
长度惩罚 |
1 |
提前停止 |
是 |
局限性
- 数据集许可证:
cc - by - nc - sa - 4.0
- 模型许可证:
apache - 2.0
- 该模型仍在开发中,虽然在很多情况下可能对“单次语法纠正”有用,但请检查输出的正确性。
📄 许可证
本项目使用的数据集遵循 cc - by - nc - sa - 4.0
许可证,模型遵循 apache - 2.0
许可证。
📚 引用信息
如果你在工作中发现这个微调模型很有用,请考虑引用它:
@misc {peter_szemraj_2022,
author = { {Peter Szemraj} },
title = { flan-t5-large-grammar-synthesis (Revision d0b5ae2) },
year = 2022,
url = { https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis },
doi = { 10.57967/hf/0138 },
publisher = { Hugging Face }
}