typo-detector-distilbert-en開源拼寫錯誤檢測模型

首頁

Typo Detector Distilbert En

由m3hrdadfi開發

基於DistilBERT架構的拼寫錯誤檢測模型，用於識別文本中的拼寫錯誤

序列標註

Transformers

英語#拼寫錯誤檢測 #高精度F1 #英文文本處理

下載量 25.05k

發布時間 : 3/2/2022

模型概述

該模型是一個基於DistilBERT的命名實體識別(NER)模型，專門用於檢測文本中的拼寫錯誤。它使用NeuSpell語料庫進行訓練，能夠高效準確地識別文本中的拼寫問題。

模型特點

高準確率

模型在拼寫錯誤檢測任務上達到0.989的F1分數

基於DistilBERT

使用輕量級DistilBERT架構，在保持性能的同時減少計算資源需求

簡單易用

可通過Transformers管道輕鬆集成到應用中

模型能力

文本拼寫錯誤檢測

命名實體識別

使用案例

文本編輯與校對

文檔校對

自動檢測文檔中的拼寫錯誤

提高文檔質量和專業度

內容審核

識別用戶生成內容中的拼寫問題

提升平臺內容質量

教育

語言學習輔助

幫助語言學習者識別寫作中的拼寫錯誤

提高學習效率

🚀 拼寫錯誤檢測器

拼寫錯誤檢測器是一個用於檢測文本中拼寫錯誤的模型。它使用特定的數據集進行訓練，並通過評估指標展示了良好的性能。用戶可以方便地使用該模型對文本進行拼寫錯誤檢測。

🚀 快速開始

安裝依賴

pip install transformers

使用pipeline進行預測

import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline


model_name_or_path = "m3hrdadfi/typo-detector-distilbert-en"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForTokenClassification.from_pretrained(model_name_or_path, config=config)
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")

sentences = [
 "He had also stgruggled with addiction during his time in Congress .",
 "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
 "Letterma also apologized two his staff for the satyation .",
 "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
 "It is left to the directors to figure out hpw to bring the stry across to tye audience .",
]

for sentence in sentences:
    typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]

    detected = sentence
    for typo in typos:
        detected = detected.replace(typo, f'<i>{typo}</i>')

    print("   [Input]: ", sentence)
    print("[Detected]: ", detected)
    print("-" * 130)

輸出:

   [Input]:  He had also stgruggled with addiction during his time in Congress .
[Detected]:  He had also <i>stgruggled</i> with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]:  The review <i>thoroughla</i> assessed all aspects of JLENS SuR and CPG <i>esign</i> <i>maturit</i> and confidence .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  Letterma also apologized two his staff for the satyation .
[Detected]:  <i>Letterma</i> also apologized <i>two</i> his staff for the <i>satyation</i> .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .
[Detected]:  Vincent Jay had earlier won France 's first gold in <i>gthe</i> 10km biathlon sprint .
----------------------------------------------------------------------------------------------------------------------------------
   [Input]:  It is left to the directors to figure out hpw to bring the stry across to tye audience .
[Detected]:  It is left to the directors to figure out <i>hpw</i> to bring the <i>stry</i> across to <i>tye</i> audience .
----------------------------------------------------------------------------------------------------------------------------------

📚 詳細文檔

數據集信息

針對此特定任務，我使用了 NeuSpell 語料庫作為原始數據。

評估

以下表格總結了模型整體以及每個類別的得分。

類別	精確率	召回率	F1分數	樣本數
拼寫錯誤	0.992332	0.985997	0.989154	416054.0
微平均	0.992332	0.985997	0.989154	416054.0
宏平均	0.992332	0.985997	0.989154	416054.0
加權平均	0.992332	0.985997	0.989154	416054.0