longformer-base-plagiarism-detection開源模型 - 免費檢測機器改寫抄襲文本，維護學術誠信

首頁

Longformer Base Plagiarism Detection

由jpwahle開發

該模型使用Longformer架構訓練，專門用於檢測機器改寫的抄襲文本，在學術誠信維護中具有重要應用價值。

文本分類

Transformers

英語#學術抄襲檢測 #長文本分析 #機器改寫識別

下載量 59.47k

發布時間 : 3/2/2022

模型概述

基於Longformer-base-4096預訓練模型微調的抄襲檢測系統，可識別通過SpinBot等工具改寫的學術文本，平均F1值達80.99%。

模型特點

長文檔處理能力

採用滑動窗口注意力機制，可有效處理長達4096個token的學術文檔

多改寫工具識別

針對SpinBot和SpinnerChief等主流改寫工具優化檢測效果

學術場景優化

在論文預印本、學位論文等學術文本上表現優異（F1最高達99.68%）

模型能力

機器改寫文本識別

學術抄襲檢測

長文本語義分析

使用案例

學術誠信維護

論文抄襲檢測

識別學生論文中使用改寫工具偽裝的抄襲內容

對SpinBot改寫文本檢測F1值達99.68%

出版審查輔助

輔助期刊編輯檢測投稿論文的潛在抄襲行為

相比傳統文本匹配系統（如Turnitin）效果更優

教育質量保障

作業原創性檢查

自動篩查學生作業中的機器改寫內容

人工評估一致性達78.4%

🚀 用於機器釋義檢測的Longformer-base模型

本模型用於檢測機器釋義的抄襲情況，能有效利用預訓練的Longformer-base模型，在相關數據集上進行訓練，為學術誠信保駕護航，提升抄襲檢測的準確性。

🚀 快速開始

模型加載與使用示例

from transformers import AutoModelForSequenceClassification, AutoTokenizer

AutoModelForSequenceClassification("jpelhaw/longformer-base-plagiarism-detection")
AutoTokenizer.from_pretrained("jpelhaw/longformer-base-plagiarism-detection")

input = "Plagiarism is the representation of another author's writing, \
thoughts, ideas, or expressions as one's own work."

example = tokenizer.tokenize(input, add_special_tokens=True)

answer = model(**example)
                                
# "plagiarised"

📚 詳細文檔

引用信息

如果您在研究工作中使用此模型，請引用以下文獻：

@InProceedings{10.1007/978-3-030-96957-8_34,
    author="Wahle, Jan Philip and Ruas, Terry and Folt{\'y}nek, Tom{\'a}{\v{s}} and Meuschke, Norman and Gipp, Bela",
    title="Identifying Machine-Paraphrased Plagiarism",
    booktitle="Information for a Better World: Shaping the Global Future",
    year="2022",
    publisher="Springer International Publishing",
    address="Cham",
    pages="393--413",
    abstract="Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we     evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99{\%} (F1=99.68{\%} for SpinBot and F1=71.64{\%} for SpinnerChief cases), while human evaluators achieved F1=78.4{\%} for SpinBot and F1=65.6{\%} for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.",
    isbn="978-3-030-96957-8"
}

額外信息

此模型是Longformer-base在機器釋義抄襲數據集上訓練後的檢查點。
更多關於此模型的信息：

📄 許可證

文檔中未提及許可證相關信息。

🔍 其他信息

縮略圖：用於社交分享的縮略圖鏈接為 url to a thumbnail used in social sharing
標籤：array、of、tags
數據集：jpwahle/machine-paraphrase-dataset
小部件示例文本：Plagiarism is the representation of another author's writing, thoughts, ideas, or expressions as one's own work.