anglicisms-spanish-mbert開源預訓練模型 - 免費部署檢測西班牙新聞英語藉詞

首頁

Anglicisms Spanish Mbert

由lirondos開發

這是一個預訓練模型，用於檢測西班牙新聞中未同化的英語詞彙借用（又稱英語藉詞）。

序列標註

Transformers

西班牙語#西班牙語藉詞檢測 #多語言BERT微調 #新聞文本分析

下載量 7,991

發布時間 : 3/28/2022

模型概述

該模型標記西班牙語中使用的外來詞彙（主要來自英語），如*fake news*、*machine learning*、*smartwatch*、*influencer*或*streaming*。

模型特點

多語言支持

基於多語言BERT架構，能夠處理多種語言中的詞彙借用問題。

高精度檢測

在測試集上對英語藉詞的F1值達到85.19。

專業語料訓練

使用COALAS語料庫訓練，包含370,000個詞，覆蓋歐洲西班牙語的多種書面媒體。

模型能力

英語藉詞檢測

外來詞識別

語碼轉換分析

使用案例

新聞分析

新聞文本分析

分析西班牙新聞中的英語藉詞使用情況

識別出如*fake news*、*machine learning*等未同化詞彙

語言學研究

詞彙借用研究

研究西班牙語中英語藉詞的使用頻率和模式

提供量化數據支持語言接觸研究

🚀 西班牙語英語藉詞檢測mBERT模型

這是一個預訓練模型，用於檢測西班牙語新聞專線中未被同化的英語詞彙藉詞（即英語外來詞）。該模型會對西班牙語中使用的外來詞（主要來自英語）進行標註，例如 fake news（假新聞）、machine learning（機器學習）、smartwatch（智能手錶）、influencer（網紅）或 streaming（流媒體）等詞彙。

該模型是多語言BERT 的微調版本，在 COALAS 語料庫上針對詞彙藉詞檢測任務進行了訓練。

該模型考慮兩種標籤：

ENG：用於標註英語詞彙藉詞（如 smartphone、online、podcast）
OTHER：用於標註來自其他語言的詞彙藉詞（如 boutique、anime、umami）

該模型使用BIO編碼來處理多詞藉詞。

⚠️ 重要提示

這並非該任務表現最佳的模型。如需表現最佳的模型（F1值為85.76），請參閱 Flair模型。

✨ 主要特性

能夠檢測西班牙語新聞專線中未被同化的英語詞彙藉詞。
對不同來源的詞彙藉詞進行分類標註。
使用BIO編碼處理多詞藉詞。

📦 安裝指南

文檔未提及安裝步驟，此處跳過。

💻 使用示例

基礎用法

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-mbert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = example = "Buscamos data scientist para proyecto de machine learning."

borrowings = nlp(example)
print(borrowings)

📚 詳細文檔

評估指標（測試集）

以下表格總結了在 COALAS 語料庫測試集上獲得的結果。

標籤	精確率	召回率	F1值
ALL	88.09	79.46	83.55
ENG	88.44	82.16	85.19
OTHER	37.5	6.52	11.11

數據集

該模型在 COALAS 語料庫上進行訓練，這是一個標註了未同化詞彙藉詞的西班牙語新聞專線語料庫。該語料庫包含370,000個詞元，涵蓋了各種用歐洲西班牙語撰寫的書面媒體。測試集的設計儘可能具有挑戰性：它涵蓋了訓練集中未出現的來源和日期，包含大量未登錄詞（測試集中92%的藉詞是未登錄詞），並且藉詞密度很高（每1000個詞元中有20個藉詞）。

數據集	詞元數量	英語藉詞數量	其他語言藉詞數量	唯一藉詞數量
訓練集	231,126	1,493	28	380
開發集	82,578	306	49	316
測試集	58,997	1,239	46	987
總計	372,701	3,038	123	1,683

🔧 技術細節

文檔未提供技術實現細節，此處跳過。

📄 許可證

本模型採用CC BY 4.0許可證。

📚 引用

如果您使用此模型，請引用以下文獻：

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}