longformer-base-plagiarism-detection开源模型 - 免费检测机器改写抄袭文本，维护学术诚信

首页

Longformer Base Plagiarism Detection

由 jpwahle 开发

该模型使用Longformer架构训练，专门用于检测机器改写的抄袭文本，在学术诚信维护中具有重要应用价值。

文本分类

Transformers

英语#学术抄袭检测 #长文本分析 #机器改写识别

下载量 59.47k

发布时间 : 3/2/2022

模型简介

基于Longformer-base-4096预训练模型微调的抄袭检测系统，可识别通过SpinBot等工具改写的学术文本，平均F1值达80.99%。

模型特点

长文档处理能力

采用滑动窗口注意力机制，可有效处理长达4096个token的学术文档

多改写工具识别

针对SpinBot和SpinnerChief等主流改写工具优化检测效果

学术场景优化

在论文预印本、学位论文等学术文本上表现优异（F1最高达99.68%）

模型能力

机器改写文本识别

学术抄袭检测

长文本语义分析

使用案例

学术诚信维护

论文抄袭检测

识别学生论文中使用改写工具伪装的抄袭内容

对SpinBot改写文本检测F1值达99.68%

出版审查辅助

辅助期刊编辑检测投稿论文的潜在抄袭行为

相比传统文本匹配系统（如Turnitin）效果更优

教育质量保障

作业原创性检查

自动筛查学生作业中的机器改写内容

人工评估一致性达78.4%

🚀 用于机器释义检测的Longformer-base模型

本模型用于检测机器释义的抄袭情况，能有效利用预训练的Longformer-base模型，在相关数据集上进行训练，为学术诚信保驾护航，提升抄袭检测的准确性。

🚀 快速开始

模型加载与使用示例

from transformers import AutoModelForSequenceClassification, AutoTokenizer

AutoModelForSequenceClassification("jpelhaw/longformer-base-plagiarism-detection")
AutoTokenizer.from_pretrained("jpelhaw/longformer-base-plagiarism-detection")

input = "Plagiarism is the representation of another author's writing, \
thoughts, ideas, or expressions as one's own work."

example = tokenizer.tokenize(input, add_special_tokens=True)

answer = model(**example)
                                
# "plagiarised"

📚 详细文档

引用信息

如果您在研究工作中使用此模型，请引用以下文献：

@InProceedings{10.1007/978-3-030-96957-8_34,
    author="Wahle, Jan Philip and Ruas, Terry and Folt{\'y}nek, Tom{\'a}{\v{s}} and Meuschke, Norman and Gipp, Bela",
    title="Identifying Machine-Paraphrased Plagiarism",
    booktitle="Information for a Better World: Shaping the Global Future",
    year="2022",
    publisher="Springer International Publishing",
    address="Cham",
    pages="393--413",
    abstract="Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we     evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers, graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99{\%} (F1=99.68{\%} for SpinBot and F1=71.64{\%} for SpinnerChief cases), while human evaluators achieved F1=78.4{\%} for SpinBot and F1=65.6{\%} for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems, such as Turnitin and PlagScan.",
    isbn="978-3-030-96957-8"
}

额外信息

此模型是Longformer-base在机器释义抄袭数据集上训练后的检查点。
更多关于此模型的信息：

📄 许可证

文档中未提及许可证相关信息。

🔍 其他信息

缩略图：用于社交分享的缩略图链接为 url to a thumbnail used in social sharing
标签：array、of、tags
数据集：jpwahle/machine-paraphrase-dataset
小部件示例文本：Plagiarism is the representation of another author's writing, thoughts, ideas, or expressions as one's own work.