anglicisms-spanish-flair-cs开源预训练模型 - 精准检测西班牙新闻中外来英语词汇

首页

Anglicisms Spanish Flair Cs

由 lirondos 开发

用于检测西班牙新闻中未同化的英语词汇借用的预训练模型，能识别如'fake news'、'machine learning'等外来词汇。

序列标注

PyTorch

西班牙语#西班牙语借词检测 #语码转换识别 #新闻文本分析

下载量 8,115

发布时间 : 3/29/2022

模型简介

该模型是一个BiLSTM-CRF模型，专门用于检测西班牙语中使用的外来词汇（主要来自英语），如*fake news*、*machine learning*等。

模型特点

多语言词汇借用检测

能够识别西班牙语中未同化的英语词汇借用（ENG标签）以及其他语言的词汇借用（OTHER标签）。

基于语码转换数据预训练

模型输入包括基于Transformer的语码转换数据预训练嵌入，提高了对混合语言文本的处理能力。

高挑战性测试集

测试集设计极具挑战性，覆盖训练集未见的来源和日期，包含大量未登录词（92%的借用词为OOV）。

模型能力

识别西班牙语中的英语借词

识别西班牙语中的其他语言借词

处理多词借用的识别

使用案例

新闻媒体分析

检测新闻中的英语借词

分析西班牙新闻中使用的英语词汇，如'fake news'、'prime time'等。

精确率90.16%，召回率84.34%，F1值87.16%（ENG标签）

语言学研究

词汇借用研究

用于研究西班牙语中未同化词汇借用的分布和趋势。

🚀 西班牙语英语借词检测预训练模型

本项目是一个预训练模型，用于检测西班牙语新闻专线中未被同化的英语词汇借词（即英语外来词）。该模型能够标记西班牙语中使用的外来词（主要来自英语），例如 fake news（假新闻）、machine learning（机器学习）、smartwatch（智能手表）、influencer（网红）或 streaming（流媒体）等。

🚀 快速开始

本模型是一个 BiLSTM - CRF 模型，它结合了基于代码切换数据预训练的 Transformer 嵌入以及子词嵌入（BPE 和字符嵌入）。该模型在 COALAS 语料库上进行训练，用于检测词汇借词。

模型标签

模型考虑两种标签：

ENG：用于标记英语词汇借词（如 smartphone、online、podcast）
OTHER：用于标记来自其他语言的词汇借词（如 boutique、anime、umami）

模型使用 BIO 编码来处理多词借词。

⚠ 还有另一个基于 mBERT 的模型用于相同任务，该模型使用 Transformers 库进行训练。不过，该模型的效果不如这个基于 Flair 的模型（F1 = 83.55）。

✨ 主要特性

评估指标（测试集）

在 COALAS 语料库的测试集上获得的结果如下：

标签	精确率	召回率	F1 值
ALL	90.14	81.79	85.76
ENG	90.16	84.34	87.16
OTHER	85.71	13.04	22.64

数据集

本模型在 COALAS 语料库上进行训练，这是一个标注了未被同化词汇借词的西班牙语新闻专线语料库。该语料库包含 370,000 个标记，涵盖了各种用欧洲西班牙语撰写的书面媒体。测试集的设计尽可能具有挑战性：它涵盖了训练集中未出现过的来源和日期，包含大量未登录词（测试集中 92% 的借词是未登录词），并且借词密度很高（每 1000 个标记中有 20 个借词）。

数据集	标记数量	英语借词数量	其他语言借词数量	唯一借词数量
训练集	231,126	1,493	28	380
开发集	82,578	306	49	316
测试集	58,997	1,239	46	987
总计	372,701	3,038	123	1,683

💻 使用示例

基础用法

from flair.data import Sentence
from flair.models import SequenceTagger
import pathlib
import os

if os.name == 'nt': # Minor patch needed if you are running from Windows
    temp = pathlib.PosixPath
    pathlib.PosixPath = pathlib.WindowsPath
  
tagger = SequenceTagger.load("lirondos/anglicisms-spanish-flair-cs")

text = "Las fake news sobre la celebrity se reprodujeron por los mass media en prime time."

sentence = Sentence(text)

# predict tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted borrowing spans
print('The following borrowing were found:')
for entity in sentence.get_spans():
    print(entity)

📄 许可证

本项目采用 CC BY 4.0 许可证。

📚 详细文档

引用

如果您使用此模型，请引用以下文献：

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}