アングリシズム - スペイン語 - mBERTオープンソース事前学習モデル - スペインのニュースの英語借用語検出を無料でデプロイ

ホーム

Anglicisms Spanish Mbert

lirondosによって開発

これは、スペイン語のニュースにおける未同化の英語の語彙借用（いわゆる英語外来語）を検出するための事前学習モデルです。

シーケンスラベリング

Transformers

スペイン語#スペイン語の外来語検出 #多言語BERTの微調整 #ニューステキスト分析

ダウンロード数 7,991

リリース時間 : 3/28/2022

モデル概要

このモデルは、スペイン語で使用される外来語（主に英語由来）をマークします。例えば、*fake news*、*machine learning*、*smartwatch*、*influencer*、*streaming* などです。

モデル特徴

多言語対応

多言語BERTアーキテクチャに基づいており、複数の言語における語彙借用問題を処理できます。

高精度検出

テストセットでの英語外来語のF1値は85.19に達します。

専門コーパスによる学習

COALASコーパスを使用して学習されており、370,000語を含み、ヨーロッパスペイン語の様々な書面媒体を網羅しています。

モデル能力

英語外来語検出

外来語識別

語コード変換分析

使用事例

ニュース分析

ニューステキスト分析

スペイン語のニュースにおける英語外来語の使用状況を分析します。

*fake news*、*machine learning* などの未同化語彙を識別します。

言語学研究

語彙借用研究

スペイン語における英語外来語の使用頻度とパターンを研究します。

言語接触研究をサポートする定量データを提供します。

🚀 アングリシズム - スペイン語 - mBERT

このモデルは、スペイン語のニュース配信における未同化の英語の語彙借用（いわゆるアングリシズム）を検出するための事前学習済みモデルです。このモデルは、スペイン語で使用される外国語（主に英語）由来の単語、例えば fake news、machine learning、smartwatch、influencer、streaming などをラベル付けします。

このモデルは、多言語BERT を COALAS コーパスで微調整し、語彙借用の検出タスク用に訓練したものです。

このモデルは2つのラベルを考慮します。

ENG: 英語の語彙借用（smartphone、online、podcast など）
OTHER: その他の言語からの語彙借用（boutique、anime、umami など）

このモデルは、複数トークンの借用語を考慮するためにBIOエンコーディングを使用しています。

⚠️ 重要提示

このモデルは、このタスクにおいて最も性能の高いモデルではありません。最も性能の高いモデル（F1=85.76）については、Flairモデルを参照してください。

📚 ドキュメント

🔢 評価指標（テストセット）

次の表は、COALAS コーパスのテストセットで得られた結果をまとめたものです。

ラベル	適合率	再現率	F1値
ALL	88.09	79.46	83.55
ENG	88.44	82.16	85.19
OTHER	37.5	6.52	11.11

📊 データセット

このモデルは、未同化の語彙借用がアノテーションされたスペイン語のニュース配信コーパスである COALAS を使用して訓練されています。このコーパスには370,000トークンが含まれており、ヨーロッパのスペイン語で書かれた様々な書き媒体が含まれています。テストセットはできるだけ難しく設計されており、訓練セットで見られないソースや日付をカバーし、多くの未知語（OOV）を含み（テストセットの借用語の92％がOOV）、借用語が密集しています（1,000トークンあたり20の借用語）。

セット	トークン数	ENG	OTHER	ユニーク
訓練セット	231,126	1,493	28	380
開発セット	82,578	306	49	316
テストセット	58,997	1,239	46	987
合計	372,701	3,038	123	1,683

ℹ️ 詳細情報

データセット、モデルの実験、エラー分析に関する詳細情報は、論文 Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling を参照してください。

💻 使用例

基本的な使用法

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lirondos/anglicisms-spanish-mbert")
model = AutoModelForTokenClassification.from_pretrained("lirondos/anglicisms-spanish-mbert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = example = "Buscamos data scientist para proyecto de machine learning."

borrowings = nlp(example)
print(borrowings)

📄 ライセンス

このモデルはCC BY 4.0ライセンスの下で提供されています。

📚 引用

このモデルを使用する場合は、次の参考文献を引用してください。

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}