Bert Restore Punctuation
模型简介
该模型用于恢复英文文本中的标点符号和大小写,适用于语音识别输出或其他丢失标点的文本处理。支持恢复的标点包括:! ? . , - : ; ' 以及单词首字母大写。
模型特点
多标点恢复
支持恢复多种标点符号,包括句号、逗号、问号、感叹号等常见标点。
大小写恢复
能够自动恢复单词的首字母大写,提升文本可读性。
长文本处理
支持任意长度的英文文本处理,适合处理长篇内容。
GPU加速
自动启用GPU加速,提高处理速度。
模型能力
标点符号恢复
大小写恢复
文本处理
长文本支持
使用案例
语音识别后处理
ASR输出文本标点恢复
将语音识别系统输出的无标点文本恢复标点和大小写。
提升文本可读性和专业性。
文本预处理
丢失标点文本恢复
处理因传输或存储丢失标点的文本。
恢复原始文本格式,便于后续分析。
🚀 ✨ bert-restore-punctuation
这是一个基于BERT的模型,经过微调后可用于恢复英文文本的标点和大小写。它在Yelp评论数据集上进行训练,能够处理各种失去标点的英文文本,例如语音识别(ASR)的输出。该模型可直接用于通用英文文本的标点恢复,也可针对特定领域文本进行进一步微调。
🚀 快速开始
以下是快速使用该模型的方法:
- 首先,安装所需的包。
pip install rpunct
- 示例Python代码。
from rpunct import RestorePuncts
# 默认为英文
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# 输出如下:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B.
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.
该模型可以处理任意长度的英文文本,并在可用时使用GPU加速。
✨ 主要特性
- 该模型可预测纯小写文本的标点和大小写,适用于语音识别输出或其他失去标点的文本场景。
- 可直接用于通用英文文本的标点恢复,也可针对特定领域文本进行进一步微调。
- 能够恢复以下标点符号 -- [! ? . , - : ; ' ] ,并恢复单词的首字母大写。
📦 安装指南
pip install rpunct
💻 使用示例
基础用法
from rpunct import RestorePuncts
# 默认为英文
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# 输出如下:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B.
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.
📚 详细文档
训练数据
以下是用于微调模型的产品评论数量:
属性 | 详情 |
---|---|
模型类型 | bert-base-uncased 微调模型 |
训练数据 | Yelp Reviews,英文文本样本数为 560,000 |
我们发现模型在大约 3 个训练周期 时达到最佳收敛效果。
准确率
微调后的模型在 45,990 个保留文本样本上的准确率如下:
准确率 | 整体 F1 值 | 评估样本数 |
---|---|---|
91% | 90% | 45,990 |
以下是模型按每个标签的性能细分:
标签 | 精确率 | 召回率 | F1 值 | 样本数 |
---|---|---|---|---|
! | 0.45 | 0.17 | 0.24 | 424 |
!+Upper | 0.43 | 0.34 | 0.38 | 98 |
' | 0.60 | 0.27 | 0.37 | 11 |
, | 0.59 | 0.51 | 0.55 | 1522 |
,+Upper | 0.52 | 0.50 | 0.51 | 239 |
- | 0.00 | 0.00 | 0.00 | 18 |
. | 0.69 | 0.84 | 0.75 | 2488 |
.+Upper | 0.65 | 0.52 | 0.57 | 274 |
: | 0.52 | 0.31 | 0.39 | 39 |
:+Upper | 0.36 | 0.62 | 0.45 | 16 |
; | 0.00 | 0.00 | 0.00 | 17 |
? | 0.54 | 0.48 | 0.51 | 46 |
?+Upper | 0.40 | 0.50 | 0.44 | 4 |
none | 0.96 | 0.96 | 0.96 | 35352 |
Upper | 0.84 | 0.82 | 0.83 | 5442 |
☕ 联系我们
如有任何问题、反馈或需要类似模型,请联系 Daulet Nurmanbetov。
📄 许可证
本项目采用 MIT 许可证。
Indonesian Roberta Base Posp Tagger
MIT
这是一个基于印尼语RoBERTa模型微调的词性标注模型,在indonlu数据集上训练,用于印尼语文本的词性标注任务。
序列标注
Transformers 其他

I
w11wo
2.2M
7
Bert Base NER
MIT
基于BERT微调的命名实体识别模型,可识别四类实体:地点(LOC)、组织机构(ORG)、人名(PER)和杂项(MISC)
序列标注 英语
B
dslim
1.8M
592
Deid Roberta I2b2
MIT
该模型是基于RoBERTa微调的序列标注模型,用于识别和移除医疗记录中的受保护健康信息(PHI/PII)。
序列标注
Transformers 支持多种语言

D
obi
1.1M
33
Ner English Fast
Flair自带的英文快速4类命名实体识别模型,基于Flair嵌入和LSTM-CRF架构,在CoNLL-03数据集上达到92.92的F1分数。
序列标注
PyTorch 英语
N
flair
978.01k
24
French Camembert Postag Model
基于Camembert-base的法语词性标注模型,使用free-french-treebank数据集训练
序列标注
Transformers 法语

F
gilf
950.03k
9
Xlm Roberta Large Ner Spanish
基于XLM-Roberta-large架构微调的西班牙语命名实体识别模型,在CoNLL-2002数据集上表现优异。
序列标注
Transformers 西班牙语

X
MMG
767.35k
29
Nusabert Ner V1.3
MIT
基于NusaBert-v1.3在印尼语NER任务上微调的命名实体识别模型
序列标注
Transformers 其他

N
cahya
759.09k
3
Ner English Large
Flair框架内置的英文4类大型NER模型,基于文档级XLM-R嵌入和FLERT技术,在CoNLL-03数据集上F1分数达94.36。
序列标注
PyTorch 英语
N
flair
749.04k
44
Punctuate All
MIT
基于xlm-roberta-base微调的多语言标点符号预测模型,支持12种欧洲语言的标点符号自动补全
序列标注
Transformers

P
kredor
728.70k
20
Xlm Roberta Ner Japanese
MIT
基于xlm-roberta-base微调的日语命名实体识别模型
序列标注
Transformers 支持多种语言

X
tsmatz
630.71k
25
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98