🚀 camembert-ner: A Fine-tuned Model from camemBERT for NER Task
camembert-ner is a Named Entity Recognition (NER) model fine-tuned from camemBERT on the wikiner-fr dataset, offering high performance in NER tasks, especially on emails/chat data.
🚀 Quick Start
camembert-ner is a NER model fine-tuned from camemBERT on the wikiner-fr dataset. The model was trained on approximately 170,634 sentences from the wikiner-fr dataset. It was validated on emails/chat data and outperformed other models on this specific type of data. In particular, it seems to work better on entities that don't start with an uppercase letter.
✨ Features
- Fine-tuned from camemBERT on the wikiner-fr dataset.
- High performance on emails/chat data.
- Works well on entities not starting with an uppercase letter.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
Load camembert-ner and its sub-word tokenizer:
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")
Advanced Usage
Process a text sample (from wikipedia):
from transformers import pipeline
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.")
[{'entity_group': 'ORG',
'score': 0.9472818374633789,
'word': 'Apple',
'start': 0,
'end': 5},
{'entity_group': 'PER',
'score': 0.9838564991950989,
'word': 'Steve Jobs',
'start': 74,
'end': 85},
{'entity_group': 'LOC',
'score': 0.9831605950991312,
'word': 'Los Altos',
'start': 87,
'end': 97},
{'entity_group': 'LOC',
'score': 0.9834540486335754,
'word': 'Californie',
'start': 100,
'end': 111},
{'entity_group': 'PER',
'score': 0.9841555754343668,
'word': 'Steve Jobs',
'start': 115,
'end': 126},
{'entity_group': 'PER',
'score': 0.9843501806259155,
'word': 'Steve Wozniak',
'start': 127,
'end': 141},
{'entity_group': 'PER',
'score': 0.9841533899307251,
'word': 'Ronald Wayne',
'start': 144,
'end': 157},
{'entity_group': 'ORG',
'score': 0.9468960364659628,
'word': 'Apple Computer',
'start': 243,
'end': 257}]
📚 Documentation
Training data
The training data was classified as follows:
Property |
Details |
Model Type |
NER model fine-tuned from camemBERT |
Training Data |
The training data was classified as follows: Abbreviation | Description O | Outside of a named entity MISC | Miscellaneous entity PER | Person’s name ORG | Organization LOC | Location |
Model performances (metric: seqeval)
Overall
Property |
Details |
Precision |
0.8859 |
Recall |
0.8971 |
F1 |
0.8914 |
By entity
Entity |
Precision |
Recall |
F1 |
PER |
0.9372 |
0.9598 |
0.9483 |
ORG |
0.8099 |
0.8265 |
0.8181 |
LOC |
0.8905 |
0.9005 |
0.8955 |
MISC |
0.8175 |
0.8117 |
0.8146 |
Related Article
For those who might be interested, here is a short article on how the author used the results of this model to train an LSTM model for signature detection in emails:
LSTM model for email signature detection
📄 License
This project is licensed under the MIT license.