๐ roberta-large-ner-english: model fine-tuned from roberta-large for NER task
This is an English NER model fine-tuned from roberta-large on the conll2003 dataset, showing excellent performance on specific data types.
๐ Quick Start
The roberta-large-ner-english
is an English NER model fine-tuned from roberta-large
on the conll2003
dataset. It was validated on emails/chat data and outperformed other models, especially on entities not starting with an upper case.
โจ Features
- Fine-tuned from
roberta-large
on the conll2003
dataset.
- Performs well on emails/chat data.
- Works better on entities that don't start with an upper case.
๐ฆ Installation
No specific installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/roberta-large-ner-english")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/roberta-large-ner-english")
from transformers import pipeline
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer")
[{'entity_group': 'ORG',
'score': 0.99381506,
'word': ' Apple',
'start': 0,
'end': 5},
{'entity_group': 'PER',
'score': 0.99970853,
'word': ' Steve Jobs',
'start': 29,
'end': 39},
{'entity_group': 'PER',
'score': 0.99981767,
'word': ' Steve Wozniak',
'start': 41,
'end': 54},
{'entity_group': 'PER',
'score': 0.99956465,
'word': ' Ronald Wayne',
'start': 59,
'end': 71},
{'entity_group': 'PER',
'score': 0.9997918,
'word': ' Wozniak',
'start': 92,
'end': 99},
{'entity_group': 'MISC',
'score': 0.99956393,
'word': ' Apple I',
'start': 102,
'end': 109}]
๐ Documentation
Training data
Training data was classified as follows:
Property |
Details |
Model Type |
English NER model fine-tuned from roberta-large |
Training Data |
The training data was from the conll2003 dataset. The classification abbreviations and their descriptions are as follows: - O: Outside of a named entity - MISC: Miscellaneous entity - PER: Personโs name - ORG: Organization - LOC: Location. The prefix B- or I- from the original conll2003 was removed. The train and test datasets from the original conll2003 were used for training, and the "validation" dataset was used for validation. The dataset sizes are: - Train: 17494 - Validation: 3250 |
Model performances
Model performances were computed on the conll2003
validation dataset (computed on the tokens predictions):
entity |
precision |
recall |
f1 |
PER |
0.9914 |
0.9927 |
0.9920 |
ORG |
0.9627 |
0.9661 |
0.9644 |
LOC |
0.9795 |
0.9862 |
0.9828 |
MISC |
0.9292 |
0.9262 |
0.9277 |
Overall |
0.9740 |
0.9766 |
0.9753 |
On a private dataset (email, chat, informal discussion), computed on word predictions:
entity |
precision |
recall |
f1 |
PER |
0.8823 |
0.9116 |
0.8967 |
ORG |
0.7694 |
0.7292 |
0.7487 |
LOC |
0.8619 |
0.7768 |
0.8171 |
By comparison on the same private dataset, Spacy
(en_core_web_trf-3.2.0
) gave the following results:
entity |
precision |
recall |
f1 |
PER |
0.9146 |
0.8287 |
0.8695 |
ORG |
0.7655 |
0.6437 |
0.6993 |
LOC |
0.8727 |
0.6180 |
0.7236 |
๐ง Technical Details
No specific technical details are provided in the original document, so this section is skipped.
๐ License
No license information is provided in the original document, so this section is skipped.