Anglicisms-Spanish-Flair-CS Open-source Pretrained Model - Accurately Detect Foreign English Words in Spanish News

Anglicisms Spanish Flair Cs

Developed by lirondos

A pre-trained model for detecting unassimilated English lexical borrowings in Spanish news, capable of identifying foreign words such as 'fake news' and 'machine learning'.

Sequence Labeling

PyTorch

Spanish#Spanish loanword detection #Code-switching identification #News text analysis

Downloads 8,115

Release Time : 3/29/2022

Model Overview

This model is a BiLSTM-CRF model specifically designed to detect foreign words (mainly from English) used in Spanish, such as *fake news* and *machine learning*.

Model Features

Multilingual lexical borrowing detection

Capable of identifying unassimilated English lexical borrowings (ENG tag) and borrowings from other languages (OTHER tag) in Spanish.

Pre-trained on code-switching data

The model input includes Transformer-based pre-trained embeddings from code-switching data, enhancing its ability to process mixed-language texts.

Highly challenging test set

The test set is designed to be highly challenging, covering sources and dates not seen in the training set, with a large number of out-of-vocabulary words (92% of borrowed words are OOV).

Model Capabilities

Identifying English loanwords in Spanish

Identifying loanwords from other languages in Spanish

Handling the recognition of multi-word borrowings

Use Cases

News media analysis

Detecting English loanwords in news

Analyzing English words used in Spanish news, such as 'fake news' and 'prime time'.

Precision 90.16%, recall 84.34%, F1-score 87.16% (ENG tag)

Linguistic research

Lexical borrowing research

Used to study the distribution and trends of unassimilated lexical borrowings in Spanish.

🚀 anglicisms-spanish-flair-cs

This is a pretrained model for detecting unassimilated English lexical borrowings in Spanish newswire, labeling words like 'fake news', 'machine learning' etc.

This model is a BiLSTM - CRF model. It uses Transformer - based embeddings pretrained on codeswitched data and subword embeddings (BPE and character embeddings). It was trained on the COALAS corpus for detecting lexical borrowings.

The model has two labels:

ENG: For English lexical borrowings (e.g., smartphone, online, podcast)
OTHER: For lexical borrowings from other languages (e.g., boutique, anime, umami)

It uses BIO encoding for multitoken borrowings.

⚠ There is another mBERT - based model for the same task, trained with the Transformers library. However, this Flair - based model outperforms it (F1 = 83.55).

🚀 Quick Start

This pretrained model is designed to detect unassimilated English lexical borrowings in Spanish newswire. It labels words of foreign origin used in the Spanish language.

✨ Features

Accurate Labeling: Can accurately label English and other - language lexical borrowings in Spanish text.
BIO Encoding: Uses BIO encoding to handle multitoken borrowings.
Transformer - based Embeddings: Leverages Transformer - based embeddings pretrained on codeswitched data for better performance.

🔧 Technical Details

The model is a BiLSTM - CRF model. It takes Transformer - based embeddings pretrained on codeswitched data along with subword embeddings (BPE and character embeddings) as input. It was trained on the COALAS corpus for the task of detecting lexical borrowings.

The model considers two labels:

ENG: For English lexical borrowings.
OTHER: For lexical borrowings from any other language.

It uses BIO encoding to account for multitoken borrowings.

📊 Metrics (on the test set)

The following results were obtained on the test set of the COALAS corpus.

Property	Details
Model Type	BiLSTM - CRF
Training Data	COALAS corpus

LABEL	Precision	Recall	F1
ALL	90.14	81.79	85.76
ENG	90.16	84.34	87.16
OTHER	85.71	13.04	22.64

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from flair.data import Sentence
from flair.models import SequenceTagger
import pathlib
import os

if os.name == 'nt': # Minor patch needed if you are running from Windows
    temp = pathlib.PosixPath
    pathlib.PosixPath = pathlib.WindowsPath
  
tagger = SequenceTagger.load("lirondos/anglicisms-spanish-flair-cs")

text = "Las fake news sobre la celebrity se reprodujeron por los mass media en prime time."

sentence = Sentence(text)

# predict tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted borrowing spans
print('The following borrowing were found:')
for entity in sentence.get_spans():
    print(entity)

📚 Documentation

More information about the dataset, model experimentation, and error analysis can be found in the paper: Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling.

📄 License

This project is licensed under the cc - by - 4.0 license.

📝 Citation

If you use this model, please cite the following reference:

@inproceedings{alvarez-mellado-lignos-2022-detecting,
    title = "Detecting Unassimilated Borrowings in {S}panish: {A}n Annotated Corpus and Approaches to Modeling",
    author = "{\'A}lvarez-Mellado, Elena  and
      Lignos, Constantine",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.268",
    pages = "3868--3888",
    abstract = "This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings{---}words from one language that are introduced into another without orthographic adaptation{---}and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご