🚀 西班牙语英语借词检测mBERT模型
这是一个预训练模型,用于检测西班牙语新闻专线中未被同化的英语词汇借词(即英语外来词)。该模型会对西班牙语中使用的外来词(主要来自英语)进行标注,例如 fake news(假新闻)、machine learning(机器学习)、smartwatch(智能手表)、influencer(网红)或 streaming(流媒体)等词汇。
该模型是 多语言BERT 的微调版本,在 COALAS 语料库上针对词汇借词检测任务进行了训练。
该模型考虑两种标签:
ENG
:用于标注英语词汇借词(如 smartphone、online、podcast)
OTHER
:用于标注来自其他语言的词汇借词(如 boutique、anime、umami)
该模型使用BIO编码来处理多词借词。
⚠️ 重要提示
这并非该任务表现最佳的模型。如需表现最佳的模型(F1值为85.76),请参阅 Flair模型。
✨ 主要特性
- 能够检测西班牙语新闻专线中未被同化的英语词汇借词。
- 对不同来源的词汇借词进行分类标注。
- 使用BIO编码处理多词借词。
📦 安装指南
文档未提及安装步骤,此处跳过。
💻 使用示例
基础用法
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-mbert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = example = "Buscamos data scientist para proyecto de machine learning."
borrowings = nlp(example)
print(borrowings)
📚 详细文档
评估指标(测试集)
以下表格总结了在 COALAS 语料库测试集上获得的结果。
标签 |
精确率 |
召回率 |
F1值 |
ALL |
88.09 |
79.46 |
83.55 |
ENG |
88.44 |
82.16 |
85.19 |
OTHER |
37.5 |
6.52 |
11.11 |
数据集
该模型在 COALAS 语料库上进行训练,这是一个标注了未同化词汇借词的西班牙语新闻专线语料库。该语料库包含370,000个词元,涵盖了各种用欧洲西班牙语撰写的书面媒体。测试集的设计尽可能具有挑战性:它涵盖了训练集中未出现的来源和日期,包含大量未登录词(测试集中92%的借词是未登录词),并且借词密度很高(每1000个词元中有20个借词)。
数据集 |
词元数量 |
英语借词数量 |
其他语言借词数量 |
唯一借词数量 |
训练集 |
231,126 |
1,493 |
28 |
380 |
开发集 |
82,578 |
306 |
49 |
316 |
测试集 |
58,997 |
1,239 |
46 |
987 |
总计 |
372,701 |
3,038 |
123 |
1,683 |
更多信息
有关数据集、模型实验和错误分析的更多信息,请参阅论文:Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling。
🔧 技术细节
文档未提供技术实现细节,此处跳过。
📄 许可证
本模型采用CC BY 4.0许可证。
📚 引用
如果您使用此模型,请引用以下文献:
@inproceedings{alvarez-mellado-lignos-2022-detecting,
title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
author = "{\'A}lvarez-Mellado, Elena and
Lignos, Constantine",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.268",
pages = "3868--3888",
abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}