punctuation_fullstop_truecase_english开源英文文本处理模型

首页

Punctuation Fullstop Truecase English

由 1-800-BAD-CODE 开发

该模型专为英文文本设计，能够同时完成标点恢复、大小写校正和句子边界检测任务。

文本生成英语开源协议:Apache-2.0 #标点恢复 #大小写校正 #多任务处理

下载量 427

发布时间 : 3/11/2023

模型简介

接收无标点的小写英文文本，一次性完成标点恢复、首字母大写和句子分段。支持特殊缩写词和任意大小写形式的单词处理。

模型特点

多任务一体化处理

同时完成标点恢复、大小写校正和句子边界检测三项任务

特殊缩写词处理

通过专用类别预测带标点的缩写（如U.S.）

灵活大小写支持

多标签预测机制支持处理NATO、McDonald's等特殊大小写形式

高效长文本处理

支持自动分段处理超过256子词的文本

模型能力

文本标点恢复

首字母大写校正

句子边界检测

特殊缩写识别

非正式文本处理

使用案例

文本规范化

新闻稿件处理

将无标点的新闻草稿转换为规范格式

标点恢复F1 97.21，大小写校正F1 99.50

对话文本整理

处理聊天记录等非正式文本

支持常见缩写和口语表达

数据预处理

NLP管道预处理

为下游任务准备规范化文本

自动分句准确率99.09

🚀 文本标点与大小写处理模型

该模型可接收小写、无标点的英文文本，一次性完成标点恢复、大小写修正（首字母大写）和句子边界检测（分割）任务。

🚀 快速开始

本模型接受小写、无标点的英文文本作为输入，能够一次性完成标点恢复、大小写修正（首字母大写）和句子边界检测（分割）。与许多类似模型不同的是，该模型可以通过特殊的“首字母缩略词”类别预测带标点的首字母缩略词（如“U.S.”），并通过多标签大小写预测处理任意大小写的单词（如“NATO”、“McDonald's”等）。

⚠️ 重要提示

文本生成小部件似乎不支持换行。相反，管道会在模型预测的句子边界处插入换行符 \n。

✨ 主要特性

多功能处理：一次性完成标点恢复、大小写修正和句子边界检测。
特殊类别支持：能预测带标点的首字母缩略词和任意大小写的单词。
长文本处理：通过特定包可处理任意长度的输入。

📦 安装指南

使用此模型的简便方法是安装 punctuators：

pip install punctuators

如果这个包出现问题，请在社区板块告知我（我会为每个模型更新它，但也经常把它弄坏！）。

💻 使用示例

基础用法

from typing import List

from punctuators.models import PunctCapSegModelONNX

# 实例化这个模型
# 这将下载ONNX和SPE模型。若要清理，可从你的HF缓存目录中删除该模型。
m = PunctCapSegModelONNX.from_pretrained("pcs_en")

# 定义一些需要添加标点的输入文本
input_texts: List[str] = [
    # 我周末的真实经历
    "i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
    "despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
    "when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars",
    # 我编造的包含首字母缩略词的句子
    "george hw bush was the president of the us for 8 years",
    "i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
    # 看看模型对编造的首字母缩略词的处理效果
    "i went to the fgw store and bought a new tg optical scope",
    # 维基百科今日特色文章摘要的前几句话
    "it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality"
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

具体输出可能会因模型版本而异，以下是当前输出：

预期输出

In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
	Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
	Out: We like to take morning adventures on the weekends.

In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
	Out: Despite being mid March, it snowed overnight and into the morning.
	Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.

In: when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars
	Out: When I got home, I trained this model on the Lambda Cloud.
	Out: On an A100 GPU with about 10 million lines of text, the total budget was less than 5 dollars.

In: george hw bush was the president of the us for 8 years
	Out: George H.W. Bush was the president of the U.S. for 8 years.

In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
	Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
	Out: I suggested he get one of those new battery operated ones.
	Out: They're so much quieter.

In: i went to the fgw store and bought a new tg optical scope
	Out: I went to the FGW store and bought a new TG optical scope.

In: it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality
	Out: It's that man again.
	Out: ITMA was a radio comedy programme that was broadcast by the BBC for Twelve Series from 1939 to 1949, featuring Tommy Handley.
	Out: In the central role, ITMA was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations.
	Out: Parts of the scripts were rewritten in the hours before the broadcast to ensure topicality.

📚 详细文档

模型详情

该模型实现了如下所示的流程图，以下是每个步骤的简要描述：

编码：模型首先使用子词分词器对文本进行分词。这里使用的分词器是一个词汇量为32k的SentencePiece模型。接着，输入序列由一个基础大小的Transformer进行编码，该Transformer由6层组成，模型维度为512。
标点预测：编码后的序列被输入到一个前馈分类网络中，以预测标点符号。标点符号是针对每个子词进行预测的，这样可以正确处理首字母缩略词。按子词预测的一个间接好处是，允许模型在连续书写语言（如中文）的通用图中运行。
句子边界检测：对于句子边界检测，我们通过嵌入对标点符号进行条件约束。每个标点预测用于为该标记选择一个嵌入，该嵌入与编码表示连接起来。句子边界检测（SBD）头分析未加标点的序列编码和标点预测，并预测哪些标记是句子边界。
句子边界移位和拼接：在英语中，每个句子的第一个字符应该大写。因此，我们应该将句子边界信息传递给大小写分类网络。由于大小写分类网络是前馈的，没有时间上下文，每个时间步必须嵌入它是否是句子的第一个单词。因此，我们将二进制句子边界决策向右移动一位：如果标记N - 1是句子边界，则标记N是句子的第一个单词。将此信息与编码文本拼接后，每个时间步都包含了SBD头预测的是否是句子第一个单词的信息。
大小写预测：在掌握了标点符号和句子边界的信息后，一个分类网络预测正确的大小写。由于大小写应该逐字符进行，分类网络为每个标记进行N次预测，其中N是子标记的长度。（实际上，N是最长可能的子词，多余的预测会被忽略）。这种方案可以处理首字母缩略词（如“NATO”）以及双大写单词（如“MacDonald”）。

由于训练嵌入的限制，模型的最大长度为256个子词。不过，如上所述的 punctuators 包将透明地对长输入的重叠子段进行预测，并在返回输出之前合并结果，从而允许输入任意长。

标点符号标记

该模型预测以下一组标点符号标记：

标记	描述
NULL	预测无标点符号
ACRONYM	这个子词中的每个字符都以句号结尾
.	拉丁句号
,	拉丁逗号
?	拉丁问号

训练详情

训练框架

该模型在 NeMo 框架的一个分支上进行训练。

训练数据

该模型使用了WMT的新闻爬取数据进行训练。大约使用了2021年和2012年的1000万行数据。使用2012年的数据是为了尝试减少偏差：年度新闻通常由少数话题主导，而2021年的新闻主要围绕COVID讨论。

局限性

领域适用性

该模型在新闻数据上进行训练，在对话或非正式数据上的表现可能不佳。

训练数据噪声

训练数据存在噪声，且未进行手动清理。

首字母缩略词和缩写

首字母缩略词和缩写的噪声尤其大；下表显示了每个标记在训练数据中出现的不同变体数量。

标记	数量
Mr	115232
Mr.	108212

标记	数量
U.S.	85324
US	37332
U.S	354
U.s	108
u.S.	65

因此，模型对首字母缩略词和缩写的预测可能有点不可预测。

句子边界检测目标

句子边界检测目标的一个假设是，输入数据的每一行恰好是一个句子。然而，训练数据中有相当一部分每行包含多个句子。因此，如果句子边界与训练数据中看到的错误相似，SBD头可能会错过明显的句子边界。

评估

在这些指标中，请记住以下几点：

数据存在噪声。
句子边界和大小写修正依赖于预测的标点符号，而标点预测是最困难的任务，有时可能会出错。当以参考标点符号为条件时，大小写修正和SBD指标相对于参考目标要高得多。
标点符号可能具有主观性。例如：

Hello Frank, how's it going?

或者

Hello Frank. How's it going?

当句子更长、更实际时，这些歧义会大量存在，并影响所有三个分析指标。

测试数据和示例生成

每个测试示例都是使用以下步骤生成的：

连接10个随机句子。
将连接后的句子转换为小写。
去除所有标点符号。

数据是新闻爬取数据的一个保留部分，已经进行了去重处理。使用了3000行数据，生成了3000个由10个句子组成的唯一示例。

评估结果

标点符号报告

    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.83      98.49      98.66     446496
    <ACRONYM> (label_id: 1)                                 74.15      94.26      83.01        697
    . (label_id: 2)                                         90.64      92.99      91.80      30002
    , (label_id: 3)                                         77.19      79.13      78.15      23321
    ? (label_id: 4)                                         76.58      74.56      75.56       1022
    -------------------
    micro avg                                               97.21      97.21      97.21     501538
    macro avg                                               83.48      87.89      85.44     501538
    weighted avg                                            97.25      97.21      97.23     501538

大小写修正报告

# 使用预测的标点符号（与目标不一致）
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.76      99.72      99.74    2020678
    UPPER (label_id: 1)                                     93.32      94.20      93.76      83873
    -------------------
    micro avg                                               99.50      99.50      99.50    2104551
    macro avg                                               96.54      96.96      96.75    2104551
    weighted avg                                            99.50      99.50      99.50    2104551


# 使用参考标点符号（标点符号与目标匹配）
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.83      99.81      99.82    2020678
    UPPER (label_id: 1)                                     95.51      95.90      95.71      83873
    -------------------
    micro avg                                               99.66      99.66      99.66    2104551
    macro avg                                               97.67      97.86      97.76    2104551
    weighted avg                                            99.66      99.66      99.66    2104551

句子边界检测报告

# 使用预测的标点符号（与目标不一致）
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.59      99.45      99.52     471608
    FULLSTOP (label_id: 1)                                  91.47      93.53      92.49      29930
    -------------------
    micro avg                                               99.09      99.09      99.09     501538
    macro avg                                               95.53      96.49      96.00     501538
    weighted avg                                            99.10      99.09      99.10     501538


# 使用参考标点符号（标点符号与目标匹配）
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.97      99.98     471608
    FULLSTOP (label_id: 1)                                  99.63      99.93      99.78      32923
    -------------------
    micro avg                                               99.97      99.97      99.97     504531
    macro avg                                               99.81      99.95      99.88     504531
    weighted avg                                            99.97      99.97      99.97     504531

有趣的发现

嵌入分析

让我们检查嵌入（见上图），看看模型是否有效地利用了它们。

这里展示了每个标记嵌入之间的余弦相似度：

	NULL	ACRONYM	.	,	?
NULL	1.00
ACRONYM	-0.49	1.00
.	-1.00	0.48	1.00
,	1.00	-0.48	-1.00	1.00
?	-1.00	0.49	1.00	-1.00	1.00

请记住，这些嵌入用于预测句子边界……因此我们应该期望句号会聚集在一起。

实际上，我们看到NULL和“,”是完全相同的，因为它们都对句子边界没有影响。

接下来，我们看到“.”和“?”是完全相同的，因为就句子边界检测而言，它们是完全相同的：强烈暗示句子结束。（不过，考虑到“.”在缩写（如“Mr.”）后被预测，而这些缩写不是完整的句子，我们可能会期望这些标记之间存在一些差异。）

此外，我们看到“.”和“?”与NULL完全相反。这是可以预期的，因为这些标记通常暗示句子边界，而NULL和“,”则从不暗示。

最后，我们看到ACRONYM与句号“.”和“?”相似，但不完全相同，与NULL和“,”相差较远，但不是相反。直觉表明，这是因为首字母缩略词可以是完整的句子（“I live in the northern U.S. It's cold here.”），也可以不是（“It's 5 a.m. and I'm tired.”）。