🚀 xlm-roberta-large-finetuned-conll02-spanish
This is an XLM - RoBERTa large model fine - tuned on the CoNLL - 2002 Spanish dataset. It can be used for token classification tasks such as Named Entity Recognition (NER) and Part - of - Speech (PoS) tagging.
🚀 Quick Start
Use the code below to get started with the model. You can use this model directly within a pipeline for NER.
Click to expand
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll02-spanish")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll02-spanish")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Efectuaba un vuelo entre bombay y nueva york.")
[{'end': 30,
'entity': 'B-LOC',
'index': 7,
'score': 0.95703226,
'start': 25,
'word': '▁bomba'},
{'end': 39,
'entity': 'B-LOC',
'index': 10,
'score': 0.9771854,
'start': 34,
'word': '▁nueva'},
{'end': 43,
'entity': 'I-LOC',
'index': 11,
'score': 0.9914097,
'start': 40,
'word': '▁yor'}]
✨ Features
Direct Use
The model is a language model. It can be used for token classification, a natural language understanding task in which a label is assigned to some tokens in a text.
Downstream Use
Potential downstream use cases include Named Entity Recognition (NER) and Part - of - Speech (PoS) tagging. To learn more about token classification and other potential downstream use cases, see the Hugging Face token classification docs.
Out - of - Scope Use
The model should not be used to intentionally create hostile or alienating environments for people.
📚 Documentation
Model Details
Model Description
The XLM - RoBERTa model was proposed in Unsupervised Cross - lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi - lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is [XLM - RoBERTa - large](https://huggingface.co/xlm - roberta - large) fine - tuned with the CoNLL - 2002 dataset in Spanish.
Property |
Details |
Developed by |
See associated paper |
Model Type |
Multi - lingual language model |
Language(s) (NLP) |
XLM - RoBERTa is a multilingual model trained on 100 different languages; see GitHub Repo for full list; model is fine - tuned on a dataset in Spanish. |
License |
More information needed |
Related Models |
[RoBERTa](https://huggingface.co/roberta - base), XLM Parent Model: [XLM - RoBERTa - large](https://huggingface.co/xlm - roberta - large) |
Resources for more information |
GitHub Repo Associated Paper CoNLL - 2002 data card |
Bias, Risks, and Limitations
⚠️ Important Note
CONTENT WARNING: Readers should be made aware that language generated by this model may be disturbing or offensive to some and may propagate historical and current stereotypes.
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl - long.330.pdf) and Bender et al. (2021)).
💡 Usage Tip
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Training
See the following resources for training data and training procedure details:
Evaluation
See the associated paper for evaluation details.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
Property |
Details |
Hardware Type |
500 32GB Nvidia V100 GPUs (from the associated paper) |
Hours used |
More information needed |
Cloud Provider |
More information needed |
Compute Region |
More information needed |
Carbon Emitted |
More information needed |
Technical Specifications
See the associated paper for further details.
Citation
BibTeX:
@article{conneau2019unsupervised,
title={Unsupervised Cross - lingual Representation Learning at Scale},
author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
APA:
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross - lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
Model Card Authors
This model card was written by the team at Hugging Face.