xlm-roberta_punctuation_fullstop_truecase开源模型 - 47种语言标点、大小写恢复及句界检测

首页

Xlm Roberta Punctuation Fullstop Truecase

由 1-800-BAD-CODE 开发

基于xlm-roberta微调的模型，用于在47种语言中恢复标点符号、正确大小写以及检测句子边界。

序列标注支持多种语言开源协议:Apache-2.0 #多语言标点恢复 #真实大小写校正 #句子边界检测

下载量 60.17k

发布时间 : 5/7/2023

模型简介

该模型能够自动为文本添加标点符号、校正大小写（首字母大写）并检测句子边界（句号），支持47种语言。

模型特点

多语言支持

支持47种语言的标点符号恢复和大小写校正。

句子边界检测

能够自动检测句子边界并添加适当的句号。

大小写校正

能够校正文本的大小写，特别是首字母大写。

高效推理

基于ONNX模型，提供高效的推理性能。

模型能力

文本标点恢复

句子边界检测

大小写校正

多语言文本处理

使用案例

文本处理

自动添加标点符号

为无标点的文本自动添加适当的标点符号。

提高文本的可读性和规范性。

句子分割

将连续文本分割成独立的句子。

便于后续的文本分析和处理。

大小写校正

将文本中的首字母大写，提高文本的规范性。

使文本更符合书写规范。

多语言应用

多语言文本处理

支持47种语言的文本处理，适用于国际化应用。

满足不同语言的文本处理需求。

🚀 XLM - Roberta多语言标点恢复模型

这是一个经过微调的xlm - roberta模型，可恢复47种语言的标点符号、正确大小写（首字母大写）并检测句子边界（句号），为文本处理提供了强大的支持。

🚀 快速开始

如果你想直接体验该模型，本页面的小部件就足够了。若要离线使用该模型，以下代码片段展示了如何使用包装器（我编写的，可从PyPI获取）和手动方式来使用该模型。

✨ 主要特性

多语言支持：支持47种语言，包括英语、西班牙语、中文、日语、阿拉伯语等。
功能丰富：能够恢复标点符号、正确大小写并检测句子边界。
无语言特定路径：可以在每种语言上运行，无需每种语言的特殊路径。

📦 安装指南

使用该模型最简单的方法是安装[punctuators](https://github.com/1 - 800 - BAD - CODE/punctuators)：

$ pip install punctuators

💻 使用示例

基础用法

from typing import List

from punctuators.models import PunctCapSegModelONNX

m: PunctCapSegModelONNX = PunctCapSegModelONNX.from_pretrained(
    "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"
)

input_texts: List[str] = [
    "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad",
    "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
    "未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭",
    "በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል",
    "こんにちは友人" "調子はどう" "今日は雨の日でしたね" "乾いた状態を保つために一日中室内で過ごしました",
    "hallo freund wie geht's es war heute ein regnerischer tag nicht wahr ich verbrachte den tag drinnen um trocken zu bleiben",
    "हैलो दोस्त ये कैसा चल रहा है आज बारिश का दिन था न मैंने सूखा रहने के लिए दिन घर के अंदर बिताया",
    "كيف تجري الامور كان يومًا ممطرًا اليوم أليس كذلك قضيت اليوم في الداخل لأظل جافًا",
]

results: List[List[str]] = m.infer(
    texts=input_texts, apply_sbd=True,
)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

高级用法

手动使用ONNX和SP模型的示例：

from typing import List

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from omegaconf import OmegaConf
from sentencepiece import SentencePieceProcessor

# Download the models from HF hub. Note: to clean up, you can find these files in your HF cache directory
spe_path = hf_hub_download(repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="sp.model")
onnx_path = hf_hub_download(repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="model.onnx")
config_path = hf_hub_download(
    repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="config.yaml"
)

# Load the SP model
tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path)  # noqa
# Load the ONNX graph
ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
# Load the model config with labels, etc.
config = OmegaConf.load(config_path)
# Potential classification labels before each subtoken
pre_labels: List[str] = config.pre_labels
# Potential classification labels after each subtoken
post_labels: List[str] = config.post_labels
# Special class that means "predict nothing"
null_token = config.get("null_token", "<NULL>")
# Special class that means "all chars in this subtoken end with a period", e.g., "am" -> "a.m."
acronym_token = config.get("acronym_token", "<ACRONYM>")
# Not used in this example, but if your sequence exceed this value, you need to fold it over multiple inputs
max_len = config.max_length
# For reference only, graph has no language-specific behavior
languages: List[str] = config.languages

# Encode some input text, adding BOS + EOS
input_text = "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad"
input_ids = [tokenizer.bos_id()] + tokenizer.EncodeAsIds(input_text) + [tokenizer.eos_id()]

# Create a numpy array with shape [B, T], as the graph expects as input.
# Note that we do not pass lengths to the graph; if you are using a batch, padding should be tokenizer.pad_id() and the
# graph's attention mechanisms will ignore pad_id() without requiring explicit sequence lengths.
input_ids_arr: np.array = np.array([input_ids])

# Run the graph, get outputs for all analytics
pre_preds, post_preds, cap_preds, sbd_preds = ort_session.run(None, {"input_ids": input_ids_arr})
# Squeeze off the batch dimensions and convert to lists
pre_preds = pre_preds[0].tolist()
post_preds = post_preds[0].tolist()
cap_preds = cap_preds[0].tolist()
sbd_preds = sbd_preds[0].tolist()

# Segmented sentences
output_texts: List[str] = []
# Current sentence, which is built until we hit a sentence boundary prediction
current_chars: List[str] = []
# Iterate over the outputs, ignoring the first (BOS) and final (EOS) predictions and tokens
for token_idx in range(1, len(input_ids) - 1):
    token = tokenizer.IdToPiece(input_ids[token_idx])
    # Simple SP decoding
    if token.startswith("▁") and current_chars:
        current_chars.append(" ")
    # Token-level predictions
    pre_label = pre_labels[pre_preds[token_idx]]
    post_label = post_labels[post_preds[token_idx]]
    # If we predict "pre-punct", insert it before this token
    if pre_label != null_token:
        current_chars.append(pre_label)
    # Iterate over each char. Skip SP's space token,
    char_start = 1 if token.startswith("▁") else 0
    for token_char_idx, char in enumerate(token[char_start:], start=char_start):
        # If this char should be capitalized, apply upper case
        if cap_preds[token_idx][token_char_idx]:
            char = char.upper()
        # Append char
        current_chars.append(char)
        # if this is an acronym, add a period after every char (p.m., a.m., etc.)
        if post_label == acronym_token:
            current_chars.append(".")
    # Maybe this subtoken ends with punctuation
    if post_label != null_token and post_label != acronym_token:
        current_chars.append(post_label)

    # If this token is a sentence boundary, finalize the current sentence and reset
    if sbd_preds[token_idx]:
        output_texts.append("".join(current_chars))
        current_chars.clear()

# Maybe push final sentence, if the final token was not classified as a sentence boundary
if current_chars:
    output_texts.append("".join(current_chars))

# Pretty print
print(f"Input: {input_text}")
print("Outputs:")
for text in output_texts:
    print(f"\t{text}")

📚 详细文档

模型架构

该模型实现了以下架构，允许在每种语言中进行标点符号恢复、正确大小写和句号预测，而无需语言特定的处理： ![graph.png](https://cdn - uploads.huggingface.co/production/uploads/62d34c813eebd640a4f97587/WJ8aWIM4A--xzYu8FR4ht.png)

模型的具体工作流程如下：

分词与编码：首先对文本进行分词，并使用XLM - Roberta进行编码，这是模型的预训练部分。
标点预测：预测每个子词前后的标点符号。预测子词前的标点符号可以处理西班牙语的倒问号；预测子词后的标点符号可以处理包括连续书写语言和缩写词在内的所有其他标点符号。
嵌入表示：使用嵌入来表示预测的标点符号，以告知句子边界预测头将插入到文本中的标点符号，从而实现正确的句号预测。
句号预测偏移：将句号预测向右移动一位，以告知正确大小写预测头每个新句子的开始位置，因为正确大小写与句子边界密切相关。
正确大小写预测：对每个子词的每个字符进行N次预测（N为子词中的字符数），将正确大小写建模为多标签问题，允许对任意字符进行大写处理。
应用预测：将所有预测应用于输入文本，即可对任何语言的文本进行标点恢复、正确大小写和句子分割。

分词器

该模型对xlm - roberta的SentencePiece模型进行了调整，使其能够正确编码文本，而不是使用FairSeq的包装器以及HuggingFace奇怪移植（未修复）的方法。根据HuggingFace的注释：

# Original fairseq vocab and spm vocab must be "aligned":
# Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
# spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'

通过以下代码对SP模型进行调整：

from sentencepiece import SentencePieceProcessor
from sentencepiece.sentencepiece_model_pb2 import ModelProto

m = ModelProto()
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())

pieces = list(m.pieces)
pieces = (
    [
        ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
        ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
    ]
    + pieces[3:]
    + [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
)
del m.pieces[:]
m.pieces.extend(pieces)

with open("/path/to/new/sp.model", "wb") as f:
    f.write(m.SerializeToString())

现在可以直接使用SP模型，而无需包装器。

后置标点符号预测

该模型预测每个子词后的以下标点符号集合：

符号	描述	相关语言
<NULL>	无标点符号	所有语言
<ACRONYM>	该子词中的所有字符后都跟一个句号	主要是英语，部分欧洲语言
.	拉丁句号	多种语言
,	拉丁逗号	多种语言
?	拉丁问号	多种语言
？	全角问号	中文、日语
，	全角逗号	中文、日语
。	全角句号	中文、日语
、	顿号	中文、日语
・	中点号	日语
।	丹达号	印地语、孟加拉语、奥里亚语
؟	阿拉伯问号	阿拉伯语
;	希腊问号	希腊语
።	阿姆哈拉语句号	阿姆哈拉语
፣	阿姆哈拉语逗号	阿姆哈拉语
፧	阿姆哈拉语问号	阿姆哈拉语

前置标点符号预测

该模型预测每个子词前的以下标点符号集合：

符号	描述	相关语言
<NULL>	无标点符号	所有语言
¿	倒问号	西班牙语

训练详情

该模型在NeMo框架中使用A100 GPU进行了约7小时的训练。你可以在tensorboard.dev上查看tensorboard日志。

训练使用了WMT的新闻爬虫数据，每种语言使用了100万行文本，但少数低资源语言可能使用了较少的数据。语言的选择基于作者对新闻爬虫语料库中是否包含足够可靠质量数据的判断。

局限性

数据适用性：该模型在新闻数据上进行训练，可能在对话或非正式数据上表现不佳。
生产质量：模型不太可能达到生产级质量，因为每种语言仅使用了“仅”100万行数据进行训练，并且由于网络抓取的新闻数据的性质，开发集可能存在噪声。
标点预测问题：模型可能会过度预测西班牙语的问号，尤其是倒问号¿；也可能会过度预测逗号。

如果你发现了此处未提及的其他局限性，请告知，以便在下一次微调中解决所有问题。

评估

在评估指标中，需要注意以下几点：

数据噪声：数据存在噪声。
条件依赖：句子边界和正确大小写的检测依赖于预测的标点符号，而标点符号预测是最困难的任务，有时可能会出错。当基于参考标点符号进行条件判断时，大多数语言的正确大小写和句子边界检测几乎可以达到100%。
标点主观性：标点符号的使用可能具有主观性，例如： Hola mundo, ¿cómo estás? 或 Hola mundo. ¿Cómo estás? 当句子更长且更实用时，这些歧义会大量存在，并影响所有三项评估指标。

测试数据与示例生成

每个测试示例的生成过程如下：

拼接11个随机句子（测试集中的1个句子 + 10个随机句子）。
将拼接后的句子转换为小写。
去除所有标点符号。

在转换为小写字母和去除标点符号的过程中生成目标标签。测试数据是新闻爬虫数据的保留部分，已经进行了去重处理。每种语言使用了3000行数据，生成了3000个包含11个句子的唯一示例。

为了测量正确大小写和句子边界检测，使用参考标点符号进行条件判断（见上文模型架构图）。如果使用预测的标点符号，那么错误的标点符号会导致正确大小写和句子边界检测的目标无法正确对齐，这些指标会人为地降低。

部分语言评估报告

目前，以下是部分选定语言的评估指标。由于收集和整理47种语言的指标需要大量工作，后续会逐步添加更多语言的评估报告。

英语

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.25      98.43      98.84     564908
    <ACRONYM> (label_id: 1)                                 63.14      84.67      72.33        613
    . (label_id: 2)                                         90.97      93.91      92.42      32040
    , (label_id: 3)                                         73.95      84.32      78.79      24271
    ? (label_id: 4)                                         79.05      81.94      80.47       1041
    ？ (label_id: 5)                                          0.00       0.00       0.00          0
    ， (label_id: 6)                                          0.00       0.00       0.00          0
    。 (label_id: 7)                                          0.00       0.00       0.00          0
    、 (label_id: 8)                                          0.00       0.00       0.00          0
    ・ (label_id: 9)                                          0.00       0.00       0.00          0
    । (label_id: 10)                                         0.00       0.00       0.00          0
    ؟ (label_id: 11)                                         0.00       0.00       0.00          0
    ، (label_id: 12)                                         0.00       0.00       0.00          0
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                         0.00       0.00       0.00          0
    ፣ (label_id: 15)                                         0.00       0.00       0.00          0
    ፧ (label_id: 16)                                         0.00       0.00       0.00          0
    -------------------
    micro avg                                               97.60      97.60      97.60     622873
    macro avg                                               81.27      88.65      84.57     622873
    weighted avg                                            97.77      97.60      97.67     622873

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.72      99.85      99.78    2134956
    UPPER (label_id: 1)                                     96.33      93.52      94.91      91996
    -------------------
    micro avg                                               99.59      99.59      99.59    2226952
    macro avg                                               98.03      96.68      97.34    2226952
    weighted avg                                            99.58      99.59      99.58    2226952

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.98      99.99     591540
    FULLSTOP (label_id: 1)                                  99.61      99.89      99.75      34333
    -------------------
    micro avg                                               99.97      99.97      99.97     625873
    macro avg                                               99.80      99.93      99.87     625873
    weighted avg                                            99.97      99.97      99.97     625873

西班牙语

  punct_pre test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.94      99.89      99.92     636941
    ¿ (label_id: 1)                                         56.73      71.35      63.20       1288
    -------------------
    micro avg                                               99.83      99.83      99.83     638229
    macro avg                                               78.34      85.62      81.56     638229
    weighted avg                                            99.85      99.83      99.84     638229

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.19      98.41      98.80     578271
    <ACRONYM> (label_id: 1)                                 30.10      56.36      39.24         55
    . (label_id: 2)                                         91.92      93.12      92.52      30856
    , (label_id: 3)                                         72.98      82.44      77.42      27761
    ? (label_id: 4)                                         52.77      71.85      60.85       1286
    ？ (label_id: 5)                                          0.00       0.00       0.00          0
    ， (label_id: 6)                                          0.00       0.00       0.00          0
    。 (label_id: 7)                                          0.00       0.00       0.00          0
    、 (label_id: 8)                                          0.00       0.00       0.00          0
    ・ (label_id: 9)                                          0.00       0.00       0.00          0
    । (label_id: 10)                                         0.00       0.00       0.00          0
    ؟ (label_id: 11)                                         0.00       0.00       0.00          0
    ، (label_id: 12)                                         0.00       0.00       0.00          0
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                         0.00       0.00       0.00          0
    ፣ (label_id: 15)                                         0.00       0.00       0.00          0
    ፧ (label_id: 16)                                         0.00       0.00       0.00          0
    -------------------
    micro avg                                               97.40      97.40      97.40     638229
    macro avg                                               69.39      80.44      73.77     638229
    weighted avg                                            97.60      97.40      97.48     638229

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.82      99.86      99.84    2324724
    UPPER (label_id: 1)                                     95.92      94.70      95.30      79266
    -------------------
    micro avg                                               99.69      99.69      99.69    2403990
    macro avg                                               97.87      97.28      97.57    2403990
    weighted avg                                            99.69      99.69      99.69    2403990

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.96      99.98     607057
    FULLSTOP (label_id: 1)                                  99.31      99.88      99.60      34172
    -------------------
    micro avg                                               99.96      99.96      99.96     641229
    macro avg                                               99.65      99.92      99.79     641229
    weighted avg                                            99.96      99.96      99.96     641229

阿姆哈拉语

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.83      99.28      99.56     729664
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00          0
    . (label_id: 2)                                          0.00       0.00       0.00          0
    , (label_id: 3)                                          0.00       0.00       0.00          0
    ? (label_id: 4)                                          0.00       0.00       0.00          0
    ？ (label_id: 5)                                          0.00       0.00       0.00          0
    ， (label_id: 6)                                          0.00       0.00       0.00          0
    。 (label_id: 7)                                          0.00       0.00       0.00          0
    、 (label_id: 8)                                          0.00       0.00       0.00          0
    ・ (label_id: 9)                                          0.00       0.00       0.00          0
    । (label_id: 10)                                         0.00       0.00       0.00          0
    ؟ (label_id: 11)                                         0.00       0.00       0.00          0
    ، (label_id: 12)                                         0.00       0.00       0.00          0
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                        91.27      97.90      94.47      25341
    ፣ (label_id: 15)                                        61.93      82.11      70.60       5818
    ፧ (label_id: 16)                                        67.41      81.73      73.89       1177
    -------------------
    micro avg                                               99.08      99.08      99.08     762000
    macro avg                                               80.11      90.26      84.63     762000
    weighted avg                                            99.21      99.08      99.13     762000

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     98.40      98.03      98.21       1064
    UPPER (label_id: 1)                                     71.23      75.36      73.24         69
    -------------------
    micro avg                                               96.65      96.65      96.65       1133
    macro avg                                               84.81      86.69      85.73       1133
    weighted avg                                            96.74      96.65      96.69       1133

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.85      99.92     743158
    FULLSTOP (label_id: 1)                                  95.20      99.62      97.36      21842
    -------------------
    micro avg                                               99.85      99.85      99.85     765000
    macro avg                                               97.59      99.74      99.85     765000
    weighted avg                                            99.85      99.85      99.85     765000

中文

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.53      97.31      98.41     435611
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00          0
    . (label_id: 2)                                          0.00       0.00       0.00          0
    , (label_id: 3)                                          0.00       0.00       0.00          0
    ? (label_id: 4)                                          0.00       0.00       0.00          0
    ？ (label_id: 5)                                         81.85      87.31      84.49       1513
    ， (label_id: 6)                                         74.08      93.67      82.73      35921
    。 (label_id: 7)                                         96.51      96.93      96.72      32097
    、 (label_id: 8)                                          0.00       0.00       0.00          0
    ・ (label_id: 9)                                          0.00       0.00       0.00          0
    । (label_id: 10)                                         0.00       0.00       0.00          0
    ؟ (label_id: 11)                                         0.00       0.00       0.00          0
    ، (label_id: 12)                                         0.00       0.00       0.00          0
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                         0.00       0.00       0.00          0
    ፣ (label_id: 15)                                         0.00       0.00       0.00          0
    ፧ (label_id: 16)                                         0.00       0.00       0.00          0
    -------------------
    micro avg                                               97.00      97.00      97.00     505142
    macro avg                                               87.99      93.81      90.59     505142
    weighted avg                                            97.48      97.00      97.15     505142

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     94.89      94.98      94.94       2951
    UPPER (label_id: 1)                                     81.34      81.03      81.18        796
    -------------------
    micro avg                                               92.02      92.02      92.02       3747
    macro avg                                               88.11      88.01      88.06       3747
    weighted avg                                            92.01      92.02      92.01       3747

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.97      99.98     473642
    FULLSTOP (label_id: 1)                                  99.55      99.90      99.72      34500
    -------------------
    micro avg                                               99.96      99.96      99.96     508142
    macro avg                                               99.77      99.93      99.85     508142
    weighted avg                                            99.96      99.96      99.96     508142

日语

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.34      95.90      97.59     406341
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00          0
    . (label_id: 2)                                          0.00       0.00       0.00          0
    , (label_id: 3)                                          0.00       0.00       0.00          0
    ? (label_id: 4)                                          0.00       0.00       0.00          0
    ？ (label_id: 5)                                         70.55      73.56      72.02       1456
    ， (label_id: 6)                                          0.00       0.00       0.00          0
    。 (label_id: 7)                                         94.38      96.95      95.65      32537
    、 (label_id: 8)                                         54.28      87.62      67.03      18610
    ・ (label_id: 9)                                         28.18      71.64      40.45       1100
    । (label_id: 10)                                         0.00       0.00       0.00          0
    ؟ (label_id: 11)                                         0.00       0.00       0.00          0
    ، (label_id: 12)                                         0.00       0.00       0.00          0
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                         0.00       0.00       0.00          0
    ፣ (label_id: 15)                                         0.00       0.00       0.00          0
    ፧ (label_id: 16)                                         0.00       0.00       0.00          0
    -------------------
    micro avg                                               95.51      95.51      95.51     460044
    macro avg                                               69.35      85.13      74.55     460044
    weighted avg                                            96.91      95.51      96.00     460044

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     92.33      94.03      93.18       4174
    UPPER (label_id: 1)                                     83.51      79.46      81.43       1587
    -------------------
    micro avg                                               90.02      90.02      90.02       5761
    macro avg                                               87.92      86.75      87.30       5761
    weighted avg                                            89.90      90.02      89.94       5761

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.92      99.96     428544
    FULLSTOP (label_id: 1)                                  99.07      99.87      99.47      34500
    -------------------
    micro avg                                               99.92      99.92      99.92     463044
    macro avg                                               99.53      99.90      99.71     463044
    weighted avg                                            99.92      99.92      99.92     463044

印地语

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.75      99.44      99.59     560358
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00          0
    . (label_id: 2)                                          0.00       0.00       0.00          0
    , (label_id: 3)                                         69.55      78.48      73.75       8084
    ? (label_id: 4)                                         63.30      87.07      73.31        317
    ？ (label_id: 5)                                          0.00       0.00       0.00          0
    ， (label_id: 6)                                          0.00       0.00       0.00          0
    。 (label_id: 7)                                          0.00       0.00       0.00          0
    、 (label_id: 8)                                          0.00       0.00       0.00          0
    ・ (label_id: 9)                                          0.00       0.00       0.00          0
    । (label_id: 10)                                        96.92      98.66      97.78      32118
    ؟ (label_id: 11)                                         0.00       0.00       0.00          0
    ، (label_id: 12)                                         0.00       0.00       0.00          0
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                         0.00       0.00       0.00          0
    ፣ (label_id: 15)                                         0.00       0.00       0.00          0
    ፧ (label_id: 16)                                         0.00       0.00       0.00          0
    -------------------
    micro avg                                               99.11      99.11      99.11     600877
    macro avg                                               82.38      90.91      86.11     600877
    weighted avg                                            99.17      99.11      99.13     600877

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     97.19      96.72      96.95       2466
    UPPER (label_id: 1)                                     89.14      90.60      89.86        734
    -------------------
    micro avg                                               95.31      95.31      95.31       3200
    macro avg                                               93.17      93.66      93.41       3200
    weighted avg                                            95.34      95.31      95.33       3200

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.99      99.99     569472
    FULLSTOP (label_id: 1)                                  99.82      99.99      99.91      34405
    -------------------
    micro avg                                               99.99      99.99      99.99     603877
    macro avg                                               99.91      99.99      99.95     603877
    weighted avg                                            99.99      99.99      99.99     603877

阿拉伯语

punct_post test report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.30      96.94      98.10     688043
    <ACRONYM> (label_id: 1)                                 93.33      77.78      84.85         18
    . (label_id: 2)                                         93.31      93.78      93.54      28175
    , (label_id: 3)                                          0.00       0.00       0.00          0
    ? (label_id: 4)                                          0.00       0.00       0.00          0
    ？ (label_id: 5)                                          0.00       0.00       0.00          0
    ， (label_id: 6)                                          0.00       0.00       0.00          0
    。 (label_id: 7)                                          0.00       0.00       0.00          0
    、 (label_id: 8)                                          0.00       0.00       0.00          0
    ・ (label_id: 9)                                          0.00       0.00       0.00          0
    । (label_id: 10)                                         0.00       0.00       0.00          0
    ؟ (label_id: 11)                                        65.93      82.79      73.40        860
    ، (label_id: 12)                                        44.89      79.20      57.30      20941
    ; (label_id: 13)                                         0.00       0.00       0.00          0
    ። (label_id: 14)                                         0.00       0.00       0.00          0
    ፣ (label_id: 15)                                         0.00       0.00       0.00          0
    ፧ (label_id: 16)                                         0.00       0.00       0.00          0
    -------------------
    micro avg                                               96.29      96.29      96.29     738037
    macro avg                                               79.35      86.10      81.44     738037
    weighted avg                                            97.49      96.29      96.74     738037

cap test report: 
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     97.10      99.49      98.28       4137
    UPPER (label_id: 1)                                     98.71      92.89      95.71       1729
    -------------------
    micro avg                                               97.55      97.55      97.55       5866
    macro avg                                               97.90      96.19      96.99       5866
    weighted avg                                            97.57      97.55      97.52       5866

seg test report: 
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.97      99.98     710456
    FULLSTOP (label_id: 1)                                  99.39      99.85      99.62      30581
    -------------------
    micro avg                                               99.97      99.97      99.97     741037
    macro avg                                               99.69      99.91      99.80     741037
    weighted avg                                            99.97      99.97      99.97     741037

特殊情况处理

缩写词、首字母缩写词和双大写单词

本节简要展示了模型在处理以下情况时的表现：

缩写词：如“NATO”。
伪缩写词：如用“NHTG”代替“NATO”。
歧义术语：可能是缩写词或专有名词，如“Tuny”。
双大写单词：如“McDavid”。
首字母缩写词：如“p.m.”。

缩写词等输入示例

from typing import List

from punctuators.models import PunctCapSegModelONNX

m: PunctCapSegModelONNX = PunctCapSegModelONNX.from_pretrained(
    "1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"
)

input_texts = [
    "the us is a nato member as a nato member the country enjoys security guarantees notably article 5",
    "the us is a nhtg member as a nhtg member the country enjoys security guarantees notably article 5",
    "the us is a tuny member as a tuny member the country enjoys security guarantees notably article 5",
    "connor andrew mcdavid is a canadian professional ice hockey centre and captain of the edmonton oilers of the national hockey league the oilers selected him first overall in the 2015 nhl entry draft mcdavid spent his childhood playing ice hockey against older children",
    "please rsvp for the party asap preferably before 8 pm tonight",
]

results: List[List[str]] = m.infer(
    texts=input_texts, apply_sbd=True,
)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

预期输出

Input: the us is a nato member as a nato member the country enjoys security guarantees notably article 5
Outputs:
	The U.S. is a NATO member.
	As a NATO member, the country enjoys security guarantees, notably Article 5.

Input: the us is a nhtg member as a nhtg member the country enjoys security guarantees notably article 5
Outputs:
	The U.S. is a NHTG member.
	As a NHTG member, the country enjoys security guarantees, notably Article 5.

Input: the us is a tuny member as a tuny member the country enjoys security guarantees notably article 5
Outputs:
	The U.S. is a Tuny member.
	As a Tuny member, the country enjoys security guarantees, notably Article 5.

Input: connor andrew mcdavid is a canadian professional ice hockey centre and captain of the edmonton oilers of the national hockey league the oilers selected him first overall in the 2015 nhl entry draft mcdavid spent his childhood playing ice hockey against older children
Outputs:
	Connor Andrew McDavid is a Canadian professional ice hockey centre and captain of the Edmonton Oilers of the National Hockey League.
	The Oilers selected him first overall in the 2015 NHL entry draft.
	McDavid spent his childhood playing ice hockey against older children.

Input: please rsvp for the party asap preferably before 8 pm tonight
Outputs:
	Please RSVP for the party ASAP, preferably before 8 p.m. tonight.