anglicisms-spanish-flair-cs開源預訓練模型 - 精準檢測西班牙新聞中外來英語詞彙

首頁

Anglicisms Spanish Flair Cs

由lirondos開發

用於檢測西班牙新聞中未同化的英語詞彙借用的預訓練模型，能識別如'fake news'、'machine learning'等外來詞彙。

序列標註

PyTorch

西班牙語#西班牙語藉詞檢測 #語碼轉換識別 #新聞文本分析

下載量 8,115

發布時間 : 3/29/2022

模型概述

該模型是一個BiLSTM-CRF模型，專門用於檢測西班牙語中使用的外來詞彙（主要來自英語），如*fake news*、*machine learning*等。

模型特點

多語言詞彙借用檢測

能夠識別西班牙語中未同化的英語詞彙借用（ENG標籤）以及其他語言的詞彙借用（OTHER標籤）。

基於語碼轉換數據預訓練

模型輸入包括基於Transformer的語碼轉換數據預訓練嵌入，提高了對混合語言文本的處理能力。

高挑戰性測試集

測試集設計極具挑戰性，覆蓋訓練集未見的來源和日期，包含大量未登錄詞（92%的借用詞為OOV）。

模型能力

識別西班牙語中的英語藉詞

識別西班牙語中的其他語言藉詞

處理多詞借用的識別

使用案例

新聞媒體分析

檢測新聞中的英語藉詞

分析西班牙新聞中使用的英語詞彙，如'fake news'、'prime time'等。

精確率90.16%，召回率84.34%，F1值87.16%（ENG標籤）

語言學研究

詞彙借用研究

用於研究西班牙語中未同化詞彙借用的分佈和趨勢。

🚀 西班牙語英語藉詞檢測預訓練模型

本項目是一個預訓練模型，用於檢測西班牙語新聞專線中未被同化的英語詞彙藉詞（即英語外來詞）。該模型能夠標記西班牙語中使用的外來詞（主要來自英語），例如 fake news（假新聞）、machine learning（機器學習）、smartwatch（智能手錶）、influencer（網紅）或 streaming（流媒體）等。

🚀 快速開始

本模型是一個 BiLSTM - CRF 模型，它結合了基於代碼切換數據預訓練的 Transformer 嵌入以及子詞嵌入（BPE 和字符嵌入）。該模型在 COALAS 語料庫上進行訓練，用於檢測詞彙藉詞。

模型標籤

模型考慮兩種標籤：

ENG：用於標記英語詞彙藉詞（如 smartphone、online、podcast）
OTHER：用於標記來自其他語言的詞彙藉詞（如 boutique、anime、umami）

模型使用 BIO 編碼來處理多詞藉詞。

⚠ 還有另一個基於 mBERT 的模型用於相同任務，該模型使用 Transformers 庫進行訓練。不過，該模型的效果不如這個基於 Flair 的模型（F1 = 83.55）。

✨ 主要特性

評估指標（測試集）

在 COALAS 語料庫的測試集上獲得的結果如下：

標籤	精確率	召回率	F1 值
ALL	90.14	81.79	85.76
ENG	90.16	84.34	87.16
OTHER	85.71	13.04	22.64

數據集

本模型在 COALAS 語料庫上進行訓練，這是一個標註了未被同化詞彙藉詞的西班牙語新聞專線語料庫。該語料庫包含 370,000 個標記，涵蓋了各種用歐洲西班牙語撰寫的書面媒體。測試集的設計儘可能具有挑戰性：它涵蓋了訓練集中未出現過的來源和日期，包含大量未登錄詞（測試集中 92% 的藉詞是未登錄詞），並且藉詞密度很高（每 1000 個標記中有 20 個藉詞）。

數據集	標記數量	英語藉詞數量	其他語言藉詞數量	唯一藉詞數量
訓練集	231,126	1,493	28	380
開發集	82,578	306	49	316
測試集	58,997	1,239	46	987
總計	372,701	3,038	123	1,683

💻 使用示例

基礎用法

from flair.data import Sentence
from flair.models import SequenceTagger
import pathlib
import os

if os.name == 'nt': # Minor patch needed if you are running from Windows
    temp = pathlib.PosixPath
    pathlib.PosixPath = pathlib.WindowsPath
  
tagger = SequenceTagger.load("lirondos/anglicisms-spanish-flair-cs")

text = "Las fake news sobre la celebrity se reprodujeron por los mass media en prime time."

sentence = Sentence(text)

# predict tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted borrowing spans
print('The following borrowing were found:')
for entity in sentence.get_spans():
    print(entity)

📄 許可證

本項目採用 CC BY 4.0 許可證。

📚 詳細文檔

引用

如果您使用此模型，請引用以下文獻：

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}