Open-source model xlm-roberta-large-finetuned-conll02-spanish - Achieve accurate named entity recognition in Spanish

Xlm Roberta Large Finetuned Conll02 Spanish

Developed by FacebookAI

Named entity recognition model fine-tuned on the Spanish CoNLL-2002 dataset based on XLM-RoBERTa-large

Sequence Labeling Supports Multiple Languages#Spanish NER #Multilingual Pretraining #Entity Recognition

Downloads 244

Release Time : 3/2/2022

Model Overview

This model is a fine-tuned version of XLM-RoBERTa-large, specifically designed for named entity recognition tasks in Spanish text.

Model Features

Multilingual Pretraining

Based on XLM-RoBERTa-large model, supporting 100 languages

Spanish Optimization

Specifically fine-tuned for Spanish text

Efficient NER Recognition

Excellent performance on the Spanish CoNLL-2002 dataset

Model Capabilities

Named Entity Recognition

Spanish Text Processing

Token Classification

Use Cases

Natural Language Processing

Spanish Text Entity Extraction

Identify entities such as person names, locations, and organization names from Spanish text

Performs well on the CoNLL-2002 dataset

Document Information Extraction

Process Spanish documents to extract key entity information

🚀 xlm-roberta-large-finetuned-conll02-spanish

This is an XLM - RoBERTa large model fine - tuned on the CoNLL - 2002 Spanish dataset. It can be used for token classification tasks such as Named Entity Recognition (NER) and Part - of - Speech (PoS) tagging.

🚀 Quick Start

Use the code below to get started with the model. You can use this model directly within a pipeline for NER.

Click to expand

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll02-spanish")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll02-spanish")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Efectuaba un vuelo entre bombay y nueva york.")

[{'end': 30,
  'entity': 'B-LOC',
  'index': 7,
  'score': 0.95703226,
  'start': 25,
  'word': '▁bomba'},
 {'end': 39,
  'entity': 'B-LOC',
  'index': 10,
  'score': 0.9771854,
  'start': 34,
  'word': '▁nueva'},
 {'end': 43,
  'entity': 'I-LOC',
  'index': 11,
  'score': 0.9914097,
  'start': 40,
  'word': '▁yor'}]

✨ Features

Direct Use

The model is a language model. It can be used for token classification, a natural language understanding task in which a label is assigned to some tokens in a text.

Downstream Use

Potential downstream use cases include Named Entity Recognition (NER) and Part - of - Speech (PoS) tagging. To learn more about token classification and other potential downstream use cases, see the Hugging Face token classification docs.

Out - of - Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

📚 Documentation

Model Details

Model Description

The XLM - RoBERTa model was proposed in Unsupervised Cross - lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi - lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is [XLM - RoBERTa - large](https://huggingface.co/xlm - roberta - large) fine - tuned with the CoNLL - 2002 dataset in Spanish.

Property	Details
Developed by	See associated paper
Model Type	Multi - lingual language model
Language(s) (NLP)	XLM - RoBERTa is a multilingual model trained on 100 different languages; see GitHub Repo for full list; model is fine - tuned on a dataset in Spanish.
License	More information needed
Related Models	[RoBERTa](https://huggingface.co/roberta - base), XLM Parent Model: [XLM - RoBERTa - large](https://huggingface.co/xlm - roberta - large)
Resources for more information	GitHub Repo Associated Paper CoNLL - 2002 data card

Bias, Risks, and Limitations

⚠️ Important Note

CONTENT WARNING: Readers should be made aware that language generated by this model may be disturbing or offensive to some and may propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl - long.330.pdf) and Bender et al. (2021)).

💡 Usage Tip

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Training

See the following resources for training data and training procedure details:

[XLM - RoBERTa - large model card](https://huggingface.co/xlm - roberta - large)
CoNLL - 2002 data card
Associated paper

Evaluation

See the associated paper for evaluation details.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Property	Details
Hardware Type	500 32GB Nvidia V100 GPUs (from the associated paper)
Hours used	More information needed
Cloud Provider	More information needed
Compute Region	More information needed
Carbon Emitted	More information needed

Technical Specifications

See the associated paper for further details.

Citation

BibTeX:

@article{conneau2019unsupervised,
  title={Unsupervised Cross - lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

APA:

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross - lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

Model Card Authors

This model card was written by the team at Hugging Face.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご