bert-restore-punctuation开源模型 - 免费部署，为纯小写文本恢复标点和大写字母！

首页

Bert Restore Punctuation

由 speeqo 开发

基于bert-base-uncased架构的模型，针对Yelp评论数据集进行了标点符号恢复的微调，可预测纯小写文本的标点符号及大写字母恢复。

序列标注

Transformers

英语开源协议:MIT #英语标点恢复 #ASR后处理 #文本规范化

下载量 19

发布时间 : 3/22/2022

模型简介

该模型用于恢复英语文本中的标点符号和大写字母，适用于语音识别输出文本或其他丢失标点符号的文本处理。

模型特点

多标点符号恢复

支持恢复多种标点符号，包括! ? . , - : ; '等

大写字母恢复

能够自动恢复单词首字母的大写

任意长度文本处理

支持处理任意长度的英文文本

GPU加速

自动启用GPU加速以提高处理速度

模型能力

标点符号恢复

大写字母恢复

文本规范化

使用案例

语音识别后处理

ASR输出文本规范化

为语音识别系统输出的无标点文本添加标点符号和大写字母

提高文本可读性和后续处理效果

文本预处理

丢失标点文本恢复

恢复因格式转换或其他原因丢失标点的文本

恢复文本原始结构和语义

🚀 ✨ BERT恢复标点模型

这是一个基于bert-base-uncased的模型，经过微调后可用于在Yelp评论上进行标点恢复。该模型能够预测纯小写文本的标点和大写形式，例如可用于自动语音识别（ASR）的输出，或其他文本丢失标点的情况。此模型既可以直接用作通用英语的标点恢复模型，也可以在特定领域的文本上进行进一步微调，以完成标点恢复任务。模型可以恢复以下标点符号 -- [! ? . , - : ; ' ]，同时也能恢复单词的大写形式。

🚀 快速开始

下面是快速使用该模型的方法。

📦 安装指南

首先，安装所需的包。

pip install rpunct

💻 使用示例

基础用法

from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

该模型可以处理任意长度的英文文本，并在可用时使用GPU进行加速。

📚 详细文档

📡 训练数据

以下是用于微调模型的产品评论数量：

属性	详情
语言	英文
文本样本数量	560,000

我们发现模型在大约 3个训练周期 时收敛效果最佳，这里展示的就是该训练周期下的模型，并且可以下载使用。

🎯 准确率

微调后的模型在45,990个保留文本样本上取得了以下准确率：

准确率	整体F1值	评估支持样本数
91%	90%	45,990

以下是模型在每个标签上的性能细分：

标签	精确率	召回率	F1分数	支持样本数
!	0.45	0.17	0.24	424
!+大写	0.43	0.34	0.38	98
'	0.60	0.27	0.37	11
,	0.59	0.51	0.55	1522
,+大写	0.52	0.50	0.51	239
-	0.00	0.00	0.00	18
.	0.69	0.84	0.75	2488
.+大写	0.65	0.52	0.57	274
:	0.52	0.31	0.39	39
:+大写	0.36	0.62	0.45	16
;	0.00	0.00	0.00	17
?	0.54	0.48	0.51	46
?+大写	0.40	0.50	0.44	4
无	0.96	0.96	0.96	35352
大写	0.84	0.82	0.83	5442