🚀 拼寫錯誤檢測器
拼寫錯誤檢測器是一個用於檢測文本中拼寫錯誤的模型。它使用特定的數據集進行訓練,並通過評估指標展示了良好的性能。用戶可以方便地使用該模型對文本進行拼寫錯誤檢測。
🚀 快速開始
安裝依賴
pip install transformers
使用pipeline進行預測
import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name_or_path = "m3hrdadfi/typo-detector-distilbert-en"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForTokenClassification.from_pretrained(model_name_or_path, config=config)
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")
sentences = [
"He had also stgruggled with addiction during his time in Congress .",
"The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
"Letterma also apologized two his staff for the satyation .",
"Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
"It is left to the directors to figure out hpw to bring the stry across to tye audience .",
]
for sentence in sentences:
typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]
detected = sentence
for typo in typos:
detected = detected.replace(typo, f'<i>{typo}</i>')
print(" [Input]: ", sentence)
print("[Detected]: ", detected)
print("-" * 130)
輸出:
[Input]: He had also stgruggled with addiction during his time in Congress .
[Detected]: He had also <i>stgruggled</i> with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]: The review <i>thoroughla</i> assessed all aspects of JLENS SuR and CPG <i>esign</i> <i>maturit</i> and confidence .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: Letterma also apologized two his staff for the satyation .
[Detected]: <i>Letterma</i> also apologized <i>two</i> his staff for the <i>satyation</i> .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .
[Detected]: Vincent Jay had earlier won France 's first gold in <i>gthe</i> 10km biathlon sprint .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: It is left to the directors to figure out hpw to bring the stry across to tye audience .
[Detected]: It is left to the directors to figure out <i>hpw</i> to bring the <i>stry</i> across to <i>tye</i> audience .
----------------------------------------------------------------------------------------------------------------------------------
📚 詳細文檔
數據集信息
針對此特定任務,我使用了 NeuSpell 語料庫作為原始數據。
評估
以下表格總結了模型整體以及每個類別的得分。
類別 |
精確率 |
召回率 |
F1分數 |
樣本數 |
拼寫錯誤 |
0.992332 |
0.985997 |
0.989154 |
416054.0 |
微平均 |
0.992332 |
0.985997 |
0.989154 |
416054.0 |
宏平均 |
0.992332 |
0.985997 |
0.989154 |
416054.0 |
加權平均 |
0.992332 |
0.985997 |
0.989154 |
416054.0 |
❓ 常見問題
若有疑問,請在 TypoDetector Issues 倉庫中提交GitHub問題。