模型概述
模型特點
模型能力
使用案例
🚀 XLM - Roberta多語言標點恢復模型
這是一個經過微調的xlm - roberta
模型,可恢復47種語言的標點符號、正確大小寫(首字母大寫)並檢測句子邊界(句號),為文本處理提供了強大的支持。
🚀 快速開始
如果你想直接體驗該模型,本頁面的小部件就足夠了。若要離線使用該模型,以下代碼片段展示瞭如何使用包裝器(我編寫的,可從PyPI
獲取)和手動方式來使用該模型。
✨ 主要特性
- 多語言支持:支持47種語言,包括英語、西班牙語、中文、日語、阿拉伯語等。
- 功能豐富:能夠恢復標點符號、正確大小寫並檢測句子邊界。
- 無語言特定路徑:可以在每種語言上運行,無需每種語言的特殊路徑。
📦 安裝指南
使用該模型最簡單的方法是安裝[punctuators](https://github.com/1 - 800 - BAD - CODE/punctuators):
$ pip install punctuators
💻 使用示例
基礎用法
from typing import List
from punctuators.models import PunctCapSegModelONNX
m: PunctCapSegModelONNX = PunctCapSegModelONNX.from_pretrained(
"1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"
)
input_texts: List[str] = [
"hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad",
"hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
"未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭",
"በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል",
"こんにちは友人" "調子はどう" "今日は雨の日でしたね" "乾いた狀態を保つために一日中室內で過ごしました",
"hallo freund wie geht's es war heute ein regnerischer tag nicht wahr ich verbrachte den tag drinnen um trocken zu bleiben",
"हैलो दोस्त ये कैसा चल रहा है आज बारिश का दिन था न मैंने सूखा रहने के लिए दिन घर के अंदर बिताया",
"كيف تجري الامور كان يومًا ممطرًا اليوم أليس كذلك قضيت اليوم في الداخل لأظل جافًا",
]
results: List[List[str]] = m.infer(
texts=input_texts, apply_sbd=True,
)
for input_text, output_texts in zip(input_texts, results):
print(f"Input: {input_text}")
print(f"Outputs:")
for text in output_texts:
print(f"\t{text}")
print()
高級用法
手動使用ONNX和SP模型的示例:
from typing import List
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from omegaconf import OmegaConf
from sentencepiece import SentencePieceProcessor
# Download the models from HF hub. Note: to clean up, you can find these files in your HF cache directory
spe_path = hf_hub_download(repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="sp.model")
onnx_path = hf_hub_download(repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="model.onnx")
config_path = hf_hub_download(
repo_id="1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase", filename="config.yaml"
)
# Load the SP model
tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path) # noqa
# Load the ONNX graph
ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
# Load the model config with labels, etc.
config = OmegaConf.load(config_path)
# Potential classification labels before each subtoken
pre_labels: List[str] = config.pre_labels
# Potential classification labels after each subtoken
post_labels: List[str] = config.post_labels
# Special class that means "predict nothing"
null_token = config.get("null_token", "<NULL>")
# Special class that means "all chars in this subtoken end with a period", e.g., "am" -> "a.m."
acronym_token = config.get("acronym_token", "<ACRONYM>")
# Not used in this example, but if your sequence exceed this value, you need to fold it over multiple inputs
max_len = config.max_length
# For reference only, graph has no language-specific behavior
languages: List[str] = config.languages
# Encode some input text, adding BOS + EOS
input_text = "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad"
input_ids = [tokenizer.bos_id()] + tokenizer.EncodeAsIds(input_text) + [tokenizer.eos_id()]
# Create a numpy array with shape [B, T], as the graph expects as input.
# Note that we do not pass lengths to the graph; if you are using a batch, padding should be tokenizer.pad_id() and the
# graph's attention mechanisms will ignore pad_id() without requiring explicit sequence lengths.
input_ids_arr: np.array = np.array([input_ids])
# Run the graph, get outputs for all analytics
pre_preds, post_preds, cap_preds, sbd_preds = ort_session.run(None, {"input_ids": input_ids_arr})
# Squeeze off the batch dimensions and convert to lists
pre_preds = pre_preds[0].tolist()
post_preds = post_preds[0].tolist()
cap_preds = cap_preds[0].tolist()
sbd_preds = sbd_preds[0].tolist()
# Segmented sentences
output_texts: List[str] = []
# Current sentence, which is built until we hit a sentence boundary prediction
current_chars: List[str] = []
# Iterate over the outputs, ignoring the first (BOS) and final (EOS) predictions and tokens
for token_idx in range(1, len(input_ids) - 1):
token = tokenizer.IdToPiece(input_ids[token_idx])
# Simple SP decoding
if token.startswith("▁") and current_chars:
current_chars.append(" ")
# Token-level predictions
pre_label = pre_labels[pre_preds[token_idx]]
post_label = post_labels[post_preds[token_idx]]
# If we predict "pre-punct", insert it before this token
if pre_label != null_token:
current_chars.append(pre_label)
# Iterate over each char. Skip SP's space token,
char_start = 1 if token.startswith("▁") else 0
for token_char_idx, char in enumerate(token[char_start:], start=char_start):
# If this char should be capitalized, apply upper case
if cap_preds[token_idx][token_char_idx]:
char = char.upper()
# Append char
current_chars.append(char)
# if this is an acronym, add a period after every char (p.m., a.m., etc.)
if post_label == acronym_token:
current_chars.append(".")
# Maybe this subtoken ends with punctuation
if post_label != null_token and post_label != acronym_token:
current_chars.append(post_label)
# If this token is a sentence boundary, finalize the current sentence and reset
if sbd_preds[token_idx]:
output_texts.append("".join(current_chars))
current_chars.clear()
# Maybe push final sentence, if the final token was not classified as a sentence boundary
if current_chars:
output_texts.append("".join(current_chars))
# Pretty print
print(f"Input: {input_text}")
print("Outputs:")
for text in output_texts:
print(f"\t{text}")
📚 詳細文檔
模型架構
該模型實現了以下架構,允許在每種語言中進行標點符號恢復、正確大小寫和句號預測,而無需語言特定的處理: 
模型的具體工作流程如下:
- 分詞與編碼:首先對文本進行分詞,並使用XLM - Roberta進行編碼,這是模型的預訓練部分。
- 標點預測:預測每個子詞前後的標點符號。預測子詞前的標點符號可以處理西班牙語的倒問號;預測子詞後的標點符號可以處理包括連續書寫語言和縮寫詞在內的所有其他標點符號。
- 嵌入表示:使用嵌入來表示預測的標點符號,以告知句子邊界預測頭將插入到文本中的標點符號,從而實現正確的句號預測。
- 句號預測偏移:將句號預測向右移動一位,以告知正確大小寫預測頭每個新句子的開始位置,因為正確大小寫與句子邊界密切相關。
- 正確大小寫預測:對每個子詞的每個字符進行
N
次預測(N
為子詞中的字符數),將正確大小寫建模為多標籤問題,允許對任意字符進行大寫處理。 - 應用預測:將所有預測應用於輸入文本,即可對任何語言的文本進行標點恢復、正確大小寫和句子分割。
分詞器
該模型對xlm - roberta
的SentencePiece模型進行了調整,使其能夠正確編碼文本,而不是使用FairSeq的包裝器以及HuggingFace奇怪移植(未修復)的方法。根據HuggingFace的註釋:
# Original fairseq vocab and spm vocab must be "aligned":
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'
通過以下代碼對SP模型進行調整:
from sentencepiece import SentencePieceProcessor
from sentencepiece.sentencepiece_model_pb2 import ModelProto
m = ModelProto()
m.ParseFromString(open("/path/to/xlmroberta/sentencepiece.bpe.model", "rb").read())
pieces = list(m.pieces)
pieces = (
[
ModelProto.SentencePiece(piece="<s>", type=ModelProto.SentencePiece.Type.CONTROL),
ModelProto.SentencePiece(piece="<pad>", type=ModelProto.SentencePiece.Type.CONTROL),
ModelProto.SentencePiece(piece="</s>", type=ModelProto.SentencePiece.Type.CONTROL),
ModelProto.SentencePiece(piece="<unk>", type=ModelProto.SentencePiece.Type.UNKNOWN),
]
+ pieces[3:]
+ [ModelProto.SentencePiece(piece="<mask>", type=ModelProto.SentencePiece.Type.USER_DEFINED)]
)
del m.pieces[:]
m.pieces.extend(pieces)
with open("/path/to/new/sp.model", "wb") as f:
f.write(m.SerializeToString())
現在可以直接使用SP模型,而無需包裝器。
後置標點符號預測
該模型預測每個子詞後的以下標點符號集合:
符號 | 描述 | 相關語言 |
---|---|---|
<NULL> | 無標點符號 | 所有語言 |
<ACRONYM> | 該子詞中的所有字符後都跟一個句號 | 主要是英語,部分歐洲語言 |
. | 拉丁句號 | 多種語言 |
, | 拉丁逗號 | 多種語言 |
? | 拉丁問號 | 多種語言 |
? | 全角問號 | 中文、日語 |
, | 全角逗號 | 中文、日語 |
。 | 全角句號 | 中文、日語 |
、 | 頓號 | 中文、日語 |
・ | 中點號 | 日語 |
। | 丹達號 | 印地語、孟加拉語、奧里亞語 |
؟ | 阿拉伯問號 | 阿拉伯語 |
; | 希臘問號 | 希臘語 |
። | 阿姆哈拉語句號 | 阿姆哈拉語 |
፣ | 阿姆哈拉語逗號 | 阿姆哈拉語 |
፧ | 阿姆哈拉語問號 | 阿姆哈拉語 |
前置標點符號預測
該模型預測每個子詞前的以下標點符號集合:
符號 | 描述 | 相關語言 |
---|---|---|
<NULL> | 無標點符號 | 所有語言 |
¿ | 倒問號 | 西班牙語 |
訓練詳情
該模型在NeMo框架中使用A100 GPU進行了約7小時的訓練。你可以在tensorboard.dev上查看tensorboard
日誌。
訓練使用了WMT的新聞爬蟲數據,每種語言使用了100萬行文本,但少數低資源語言可能使用了較少的數據。語言的選擇基於作者對新聞爬蟲語料庫中是否包含足夠可靠質量數據的判斷。
侷限性
- 數據適用性:該模型在新聞數據上進行訓練,可能在對話或非正式數據上表現不佳。
- 生產質量:模型不太可能達到生產級質量,因為每種語言僅使用了“僅”100萬行數據進行訓練,並且由於網絡抓取的新聞數據的性質,開發集可能存在噪聲。
- 標點預測問題:模型可能會過度預測西班牙語的問號,尤其是倒問號
¿
;也可能會過度預測逗號。
如果你發現了此處未提及的其他侷限性,請告知,以便在下一次微調中解決所有問題。
評估
在評估指標中,需要注意以下幾點:
- 數據噪聲:數據存在噪聲。
- 條件依賴:句子邊界和正確大小寫的檢測依賴於預測的標點符號,而標點符號預測是最困難的任務,有時可能會出錯。當基於參考標點符號進行條件判斷時,大多數語言的正確大小寫和句子邊界檢測幾乎可以達到100%。
- 標點主觀性:標點符號的使用可能具有主觀性,例如:
Hola mundo, ¿cómo estás?
或Hola mundo. ¿Cómo estás?
當句子更長且更實用時,這些歧義會大量存在,並影響所有三項評估指標。
測試數據與示例生成
每個測試示例的生成過程如下:
- 拼接11個隨機句子(測試集中的1個句子 + 10個隨機句子)。
- 將拼接後的句子轉換為小寫。
- 去除所有標點符號。
在轉換為小寫字母和去除標點符號的過程中生成目標標籤。測試數據是新聞爬蟲數據的保留部分,已經進行了去重處理。每種語言使用了3000行數據,生成了3000個包含11個句子的唯一示例。
為了測量正確大小寫和句子邊界檢測,使用參考標點符號進行條件判斷(見上文模型架構圖)。如果使用預測的標點符號,那麼錯誤的標點符號會導致正確大小寫和句子邊界檢測的目標無法正確對齊,這些指標會人為地降低。
部分語言評估報告
目前,以下是部分選定語言的評估指標。由於收集和整理47種語言的指標需要大量工作,後續會逐步添加更多語言的評估報告。
英語
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.25 98.43 98.84 564908
<ACRONYM> (label_id: 1) 63.14 84.67 72.33 613
. (label_id: 2) 90.97 93.91 92.42 32040
, (label_id: 3) 73.95 84.32 78.79 24271
? (label_id: 4) 79.05 81.94 80.47 1041
? (label_id: 5) 0.00 0.00 0.00 0
, (label_id: 6) 0.00 0.00 0.00 0
。 (label_id: 7) 0.00 0.00 0.00 0
、 (label_id: 8) 0.00 0.00 0.00 0
・ (label_id: 9) 0.00 0.00 0.00 0
। (label_id: 10) 0.00 0.00 0.00 0
؟ (label_id: 11) 0.00 0.00 0.00 0
، (label_id: 12) 0.00 0.00 0.00 0
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 0.00 0.00 0.00 0
፣ (label_id: 15) 0.00 0.00 0.00 0
፧ (label_id: 16) 0.00 0.00 0.00 0
-------------------
micro avg 97.60 97.60 97.60 622873
macro avg 81.27 88.65 84.57 622873
weighted avg 97.77 97.60 97.67 622873
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 99.72 99.85 99.78 2134956
UPPER (label_id: 1) 96.33 93.52 94.91 91996
-------------------
micro avg 99.59 99.59 99.59 2226952
macro avg 98.03 96.68 97.34 2226952
weighted avg 99.58 99.59 99.58 2226952
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.98 99.99 591540
FULLSTOP (label_id: 1) 99.61 99.89 99.75 34333
-------------------
micro avg 99.97 99.97 99.97 625873
macro avg 99.80 99.93 99.87 625873
weighted avg 99.97 99.97 99.97 625873
西班牙語
punct_pre test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.94 99.89 99.92 636941
¿ (label_id: 1) 56.73 71.35 63.20 1288
-------------------
micro avg 99.83 99.83 99.83 638229
macro avg 78.34 85.62 81.56 638229
weighted avg 99.85 99.83 99.84 638229
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.19 98.41 98.80 578271
<ACRONYM> (label_id: 1) 30.10 56.36 39.24 55
. (label_id: 2) 91.92 93.12 92.52 30856
, (label_id: 3) 72.98 82.44 77.42 27761
? (label_id: 4) 52.77 71.85 60.85 1286
? (label_id: 5) 0.00 0.00 0.00 0
, (label_id: 6) 0.00 0.00 0.00 0
。 (label_id: 7) 0.00 0.00 0.00 0
、 (label_id: 8) 0.00 0.00 0.00 0
・ (label_id: 9) 0.00 0.00 0.00 0
। (label_id: 10) 0.00 0.00 0.00 0
؟ (label_id: 11) 0.00 0.00 0.00 0
، (label_id: 12) 0.00 0.00 0.00 0
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 0.00 0.00 0.00 0
፣ (label_id: 15) 0.00 0.00 0.00 0
፧ (label_id: 16) 0.00 0.00 0.00 0
-------------------
micro avg 97.40 97.40 97.40 638229
macro avg 69.39 80.44 73.77 638229
weighted avg 97.60 97.40 97.48 638229
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 99.82 99.86 99.84 2324724
UPPER (label_id: 1) 95.92 94.70 95.30 79266
-------------------
micro avg 99.69 99.69 99.69 2403990
macro avg 97.87 97.28 97.57 2403990
weighted avg 99.69 99.69 99.69 2403990
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.96 99.98 607057
FULLSTOP (label_id: 1) 99.31 99.88 99.60 34172
-------------------
micro avg 99.96 99.96 99.96 641229
macro avg 99.65 99.92 99.79 641229
weighted avg 99.96 99.96 99.96 641229
阿姆哈拉語
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.83 99.28 99.56 729664
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0
. (label_id: 2) 0.00 0.00 0.00 0
, (label_id: 3) 0.00 0.00 0.00 0
? (label_id: 4) 0.00 0.00 0.00 0
? (label_id: 5) 0.00 0.00 0.00 0
, (label_id: 6) 0.00 0.00 0.00 0
。 (label_id: 7) 0.00 0.00 0.00 0
、 (label_id: 8) 0.00 0.00 0.00 0
・ (label_id: 9) 0.00 0.00 0.00 0
। (label_id: 10) 0.00 0.00 0.00 0
؟ (label_id: 11) 0.00 0.00 0.00 0
، (label_id: 12) 0.00 0.00 0.00 0
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 91.27 97.90 94.47 25341
፣ (label_id: 15) 61.93 82.11 70.60 5818
፧ (label_id: 16) 67.41 81.73 73.89 1177
-------------------
micro avg 99.08 99.08 99.08 762000
macro avg 80.11 90.26 84.63 762000
weighted avg 99.21 99.08 99.13 762000
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 98.40 98.03 98.21 1064
UPPER (label_id: 1) 71.23 75.36 73.24 69
-------------------
micro avg 96.65 96.65 96.65 1133
macro avg 84.81 86.69 85.73 1133
weighted avg 96.74 96.65 96.69 1133
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.85 99.92 743158
FULLSTOP (label_id: 1) 95.20 99.62 97.36 21842
-------------------
micro avg 99.85 99.85 99.85 765000
macro avg 97.59 99.74 99.85 765000
weighted avg 99.85 99.85 99.85 765000
中文
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.53 97.31 98.41 435611
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0
. (label_id: 2) 0.00 0.00 0.00 0
, (label_id: 3) 0.00 0.00 0.00 0
? (label_id: 4) 0.00 0.00 0.00 0
? (label_id: 5) 81.85 87.31 84.49 1513
, (label_id: 6) 74.08 93.67 82.73 35921
。 (label_id: 7) 96.51 96.93 96.72 32097
、 (label_id: 8) 0.00 0.00 0.00 0
・ (label_id: 9) 0.00 0.00 0.00 0
। (label_id: 10) 0.00 0.00 0.00 0
؟ (label_id: 11) 0.00 0.00 0.00 0
، (label_id: 12) 0.00 0.00 0.00 0
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 0.00 0.00 0.00 0
፣ (label_id: 15) 0.00 0.00 0.00 0
፧ (label_id: 16) 0.00 0.00 0.00 0
-------------------
micro avg 97.00 97.00 97.00 505142
macro avg 87.99 93.81 90.59 505142
weighted avg 97.48 97.00 97.15 505142
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 94.89 94.98 94.94 2951
UPPER (label_id: 1) 81.34 81.03 81.18 796
-------------------
micro avg 92.02 92.02 92.02 3747
macro avg 88.11 88.01 88.06 3747
weighted avg 92.01 92.02 92.01 3747
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.97 99.98 473642
FULLSTOP (label_id: 1) 99.55 99.90 99.72 34500
-------------------
micro avg 99.96 99.96 99.96 508142
macro avg 99.77 99.93 99.85 508142
weighted avg 99.96 99.96 99.96 508142
日語
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.34 95.90 97.59 406341
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0
. (label_id: 2) 0.00 0.00 0.00 0
, (label_id: 3) 0.00 0.00 0.00 0
? (label_id: 4) 0.00 0.00 0.00 0
? (label_id: 5) 70.55 73.56 72.02 1456
, (label_id: 6) 0.00 0.00 0.00 0
。 (label_id: 7) 94.38 96.95 95.65 32537
、 (label_id: 8) 54.28 87.62 67.03 18610
・ (label_id: 9) 28.18 71.64 40.45 1100
। (label_id: 10) 0.00 0.00 0.00 0
؟ (label_id: 11) 0.00 0.00 0.00 0
، (label_id: 12) 0.00 0.00 0.00 0
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 0.00 0.00 0.00 0
፣ (label_id: 15) 0.00 0.00 0.00 0
፧ (label_id: 16) 0.00 0.00 0.00 0
-------------------
micro avg 95.51 95.51 95.51 460044
macro avg 69.35 85.13 74.55 460044
weighted avg 96.91 95.51 96.00 460044
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 92.33 94.03 93.18 4174
UPPER (label_id: 1) 83.51 79.46 81.43 1587
-------------------
micro avg 90.02 90.02 90.02 5761
macro avg 87.92 86.75 87.30 5761
weighted avg 89.90 90.02 89.94 5761
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.92 99.96 428544
FULLSTOP (label_id: 1) 99.07 99.87 99.47 34500
-------------------
micro avg 99.92 99.92 99.92 463044
macro avg 99.53 99.90 99.71 463044
weighted avg 99.92 99.92 99.92 463044
印地語
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.75 99.44 99.59 560358
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0
. (label_id: 2) 0.00 0.00 0.00 0
, (label_id: 3) 69.55 78.48 73.75 8084
? (label_id: 4) 63.30 87.07 73.31 317
? (label_id: 5) 0.00 0.00 0.00 0
, (label_id: 6) 0.00 0.00 0.00 0
。 (label_id: 7) 0.00 0.00 0.00 0
、 (label_id: 8) 0.00 0.00 0.00 0
・ (label_id: 9) 0.00 0.00 0.00 0
। (label_id: 10) 96.92 98.66 97.78 32118
؟ (label_id: 11) 0.00 0.00 0.00 0
، (label_id: 12) 0.00 0.00 0.00 0
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 0.00 0.00 0.00 0
፣ (label_id: 15) 0.00 0.00 0.00 0
፧ (label_id: 16) 0.00 0.00 0.00 0
-------------------
micro avg 99.11 99.11 99.11 600877
macro avg 82.38 90.91 86.11 600877
weighted avg 99.17 99.11 99.13 600877
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 97.19 96.72 96.95 2466
UPPER (label_id: 1) 89.14 90.60 89.86 734
-------------------
micro avg 95.31 95.31 95.31 3200
macro avg 93.17 93.66 93.41 3200
weighted avg 95.34 95.31 95.33 3200
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 100.00 99.99 99.99 569472
FULLSTOP (label_id: 1) 99.82 99.99 99.91 34405
-------------------
micro avg 99.99 99.99 99.99 603877
macro avg 99.91 99.99 99.95 603877
weighted avg 99.99 99.99 99.99 603877
阿拉伯語
punct_post test report:
label precision recall f1 support
<NULL> (label_id: 0) 99.30 96.94 98.10 688043
<ACRONYM> (label_id: 1) 93.33 77.78 84.85 18
. (label_id: 2) 93.31 93.78 93.54 28175
, (label_id: 3) 0.00 0.00 0.00 0
? (label_id: 4) 0.00 0.00 0.00 0
? (label_id: 5) 0.00 0.00 0.00 0
, (label_id: 6) 0.00 0.00 0.00 0
。 (label_id: 7) 0.00 0.00 0.00 0
、 (label_id: 8) 0.00 0.00 0.00 0
・ (label_id: 9) 0.00 0.00 0.00 0
। (label_id: 10) 0.00 0.00 0.00 0
؟ (label_id: 11) 65.93 82.79 73.40 860
، (label_id: 12) 44.89 79.20 57.30 20941
; (label_id: 13) 0.00 0.00 0.00 0
። (label_id: 14) 0.00 0.00 0.00 0
፣ (label_id: 15) 0.00 0.00 0.00 0
፧ (label_id: 16) 0.00 0.00 0.00 0
-------------------
micro avg 96.29 96.29 96.29 738037
macro avg 79.35 86.10 81.44 738037
weighted avg 97.49 96.29 96.74 738037
cap test report:
label precision recall f1 support
LOWER (label_id: 0) 97.10 99.49 98.28 4137
UPPER (label_id: 1) 98.71 92.89 95.71 1729
-------------------
micro avg 97.55 97.55 97.55 5866
macro avg 97.90 96.19 96.99 5866
weighted avg 97.57 97.55 97.52 5866
seg test report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.97 99.98 710456
FULLSTOP (label_id: 1) 99.39 99.85 99.62 30581
-------------------
micro avg 99.97 99.97 99.97 741037
macro avg 99.69 99.91 99.80 741037
weighted avg 99.97 99.97 99.97 741037
特殊情況處理
縮寫詞、首字母縮寫詞和雙大寫單詞
本節簡要展示了模型在處理以下情況時的表現:
- 縮寫詞:如“NATO”。
- 偽縮寫詞:如用“NHTG”代替“NATO”。
- 歧義術語:可能是縮寫詞或專有名詞,如“Tuny”。
- 雙大寫單詞:如“McDavid”。
- 首字母縮寫詞:如“p.m.”。
縮寫詞等輸入示例
from typing import List
from punctuators.models import PunctCapSegModelONNX
m: PunctCapSegModelONNX = PunctCapSegModelONNX.from_pretrained(
"1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase"
)
input_texts = [
"the us is a nato member as a nato member the country enjoys security guarantees notably article 5",
"the us is a nhtg member as a nhtg member the country enjoys security guarantees notably article 5",
"the us is a tuny member as a tuny member the country enjoys security guarantees notably article 5",
"connor andrew mcdavid is a canadian professional ice hockey centre and captain of the edmonton oilers of the national hockey league the oilers selected him first overall in the 2015 nhl entry draft mcdavid spent his childhood playing ice hockey against older children",
"please rsvp for the party asap preferably before 8 pm tonight",
]
results: List[List[str]] = m.infer(
texts=input_texts, apply_sbd=True,
)
for input_text, output_texts in zip(input_texts, results):
print(f"Input: {input_text}")
print(f"Outputs:")
for text in output_texts:
print(f"\t{text}")
print()
預期輸出
Input: the us is a nato member as a nato member the country enjoys security guarantees notably article 5
Outputs:
The U.S. is a NATO member.
As a NATO member, the country enjoys security guarantees, notably Article 5.
Input: the us is a nhtg member as a nhtg member the country enjoys security guarantees notably article 5
Outputs:
The U.S. is a NHTG member.
As a NHTG member, the country enjoys security guarantees, notably Article 5.
Input: the us is a tuny member as a tuny member the country enjoys security guarantees notably article 5
Outputs:
The U.S. is a Tuny member.
As a Tuny member, the country enjoys security guarantees, notably Article 5.
Input: connor andrew mcdavid is a canadian professional ice hockey centre and captain of the edmonton oilers of the national hockey league the oilers selected him first overall in the 2015 nhl entry draft mcdavid spent his childhood playing ice hockey against older children
Outputs:
Connor Andrew McDavid is a Canadian professional ice hockey centre and captain of the Edmonton Oilers of the National Hockey League.
The Oilers selected him first overall in the 2015 NHL entry draft.
McDavid spent his childhood playing ice hockey against older children.
Input: please rsvp for the party asap preferably before 8 pm tonight
Outputs:
Please RSVP for the party ASAP, preferably before 8 p.m. tonight.
📄 許可證
該模型採用Apache 2.0許可證。








