anglicisms-spanish-mbert开源预训练模型 - 免费部署检测西班牙新闻英语借词

首页

Anglicisms Spanish Mbert

由 lirondos 开发

这是一个预训练模型，用于检测西班牙新闻中未同化的英语词汇借用（又称英语借词）。

序列标注

Transformers

西班牙语#西班牙语借词检测 #多语言BERT微调 #新闻文本分析

下载量 7,991

发布时间 : 3/28/2022

模型简介

该模型标记西班牙语中使用的外来词汇（主要来自英语），如*fake news*、*machine learning*、*smartwatch*、*influencer*或*streaming*。

模型特点

多语言支持

基于多语言BERT架构，能够处理多种语言中的词汇借用问题。

高精度检测

在测试集上对英语借词的F1值达到85.19。

专业语料训练

使用COALAS语料库训练，包含370,000个词，覆盖欧洲西班牙语的多种书面媒体。

模型能力

英语借词检测

外来词识别

语码转换分析

使用案例

新闻分析

新闻文本分析

分析西班牙新闻中的英语借词使用情况

识别出如*fake news*、*machine learning*等未同化词汇

语言学研究

词汇借用研究

研究西班牙语中英语借词的使用频率和模式

提供量化数据支持语言接触研究

🚀 西班牙语英语借词检测mBERT模型

这是一个预训练模型，用于检测西班牙语新闻专线中未被同化的英语词汇借词（即英语外来词）。该模型会对西班牙语中使用的外来词（主要来自英语）进行标注，例如 fake news（假新闻）、machine learning（机器学习）、smartwatch（智能手表）、influencer（网红）或 streaming（流媒体）等词汇。

该模型是多语言BERT 的微调版本，在 COALAS 语料库上针对词汇借词检测任务进行了训练。

该模型考虑两种标签：

ENG：用于标注英语词汇借词（如 smartphone、online、podcast）
OTHER：用于标注来自其他语言的词汇借词（如 boutique、anime、umami）

该模型使用BIO编码来处理多词借词。

⚠️ 重要提示

这并非该任务表现最佳的模型。如需表现最佳的模型（F1值为85.76），请参阅 Flair模型。

✨ 主要特性

能够检测西班牙语新闻专线中未被同化的英语词汇借词。
对不同来源的词汇借词进行分类标注。
使用BIO编码处理多词借词。

📦 安装指南

文档未提及安装步骤，此处跳过。

💻 使用示例

基础用法

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-mbert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = example = "Buscamos data scientist para proyecto de machine learning."

borrowings = nlp(example)
print(borrowings)

📚 详细文档

评估指标（测试集）

以下表格总结了在 COALAS 语料库测试集上获得的结果。

标签	精确率	召回率	F1值
ALL	88.09	79.46	83.55
ENG	88.44	82.16	85.19
OTHER	37.5	6.52	11.11

数据集

该模型在 COALAS 语料库上进行训练，这是一个标注了未同化词汇借词的西班牙语新闻专线语料库。该语料库包含370,000个词元，涵盖了各种用欧洲西班牙语撰写的书面媒体。测试集的设计尽可能具有挑战性：它涵盖了训练集中未出现的来源和日期，包含大量未登录词（测试集中92%的借词是未登录词），并且借词密度很高（每1000个词元中有20个借词）。

数据集	词元数量	英语借词数量	其他语言借词数量	唯一借词数量
训练集	231,126	1,493	28	380
开发集	82,578	306	49	316
测试集	58,997	1,239	46	987
总计	372,701	3,038	123	1,683

🔧 技术细节

文档未提供技术实现细节，此处跳过。

📄 许可证

本模型采用CC BY 4.0许可证。

📚 引用

如果您使用此模型，请引用以下文献：

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}