Camembert-NER Open-Source Model - Accurately Handle French Text Named Entity Recognition Tasks

Camembert Ner

Developed by Jean-Baptiste

A Named Entity Recognition (NER) model fine-tuned on the wikiner-fr dataset based on camemBERT, excelling in handling named entity recognition tasks in French texts.

Sequence Labeling

Transformers

FrenchOpen Source License:MIT #French Named Entity Recognition #Optimization for Non-Capitalized Entities #Fine-tuned on wikiner-fr

Downloads 230.81k

Release Time : 3/2/2022

Model Overview

This model is specifically designed for named entity recognition in French texts, capable of identifying various entity types including person names, organizations, geographical locations, and more.

Model Features

Efficient Recognition of Non-Capitalized Entities

Performs better than other similar models when handling entities that do not start with a capital letter.

Trained on High-Quality Dataset

Trained on the wikiner-fr dataset (approximately 170,634 sentences) and validated on email/chat data.

Model Capabilities

Recognize named entities in French texts

Classify entity types (PER, ORG, LOC, MISC)

Use Cases

Text Analysis

Wikipedia Text Analysis

Extract named entities from Wikipedia texts

High accuracy in identifying organizations, person names, and geographical locations

Email Signature Detection

Identify signature information in emails

Can be used to train LSTM models for more precise detection

🚀 camembert-ner: A Fine-tuned Model from camemBERT for NER Task

camembert-ner is a Named Entity Recognition (NER) model fine-tuned from camemBERT on the wikiner-fr dataset, offering high performance in NER tasks, especially on emails/chat data.

🚀 Quick Start

camembert-ner is a NER model fine-tuned from camemBERT on the wikiner-fr dataset. The model was trained on approximately 170,634 sentences from the wikiner-fr dataset. It was validated on emails/chat data and outperformed other models on this specific type of data. In particular, it seems to work better on entities that don't start with an uppercase letter.

✨ Features

Fine-tuned from camemBERT on the wikiner-fr dataset.
High performance on emails/chat data.
Works well on entities not starting with an uppercase letter.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Load camembert-ner and its sub-word tokenizer:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")

Advanced Usage

Process a text sample (from wikipedia):

from transformers import pipeline

nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
nlp("Apple est créée le 1er avril 1976 dans le garage de la maison d'enfance de Steve Jobs à Los Altos en Californie par Steve Jobs, Steve Wozniak et Ronald Wayne14, puis constituée sous forme de société le 3 janvier 1977 à l'origine sous le nom d'Apple Computer, mais pour ses 30 ans et pour refléter la diversification de ses produits, le mot « computer » est retiré le 9 janvier 2015.")

[{'entity_group': 'ORG',
  'score': 0.9472818374633789,
  'word': 'Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.9838564991950989,
  'word': 'Steve Jobs',
  'start': 74,
  'end': 85},
 {'entity_group': 'LOC',
  'score': 0.9831605950991312,
  'word': 'Los Altos',
  'start': 87,
  'end': 97},
 {'entity_group': 'LOC',
  'score': 0.9834540486335754,
  'word': 'Californie',
  'start': 100,
  'end': 111},
 {'entity_group': 'PER',
  'score': 0.9841555754343668,
  'word': 'Steve Jobs',
  'start': 115,
  'end': 126},
 {'entity_group': 'PER',
  'score': 0.9843501806259155,
  'word': 'Steve Wozniak',
  'start': 127,
  'end': 141},
 {'entity_group': 'PER',
  'score': 0.9841533899307251,
  'word': 'Ronald Wayne',
  'start': 144,
  'end': 157},
 {'entity_group': 'ORG',
  'score': 0.9468960364659628,
  'word': 'Apple Computer',
  'start': 243,
  'end': 257}]

📚 Documentation

Training data

The training data was classified as follows:

Property	Details
Model Type	NER model fine-tuned from camemBERT
Training Data	The training data was classified as follows: Abbreviation \| Description O \| Outside of a named entity MISC \| Miscellaneous entity PER \| Person’s name ORG \| Organization LOC \| Location

Model performances (metric: seqeval)

Overall

Property	Details
Precision	0.8859
Recall	0.8971
F1	0.8914

By entity

Entity	Precision	Recall	F1
PER	0.9372	0.9598	0.9483
ORG	0.8099	0.8265	0.8181
LOC	0.8905	0.9005	0.8955
MISC	0.8175	0.8117	0.8146

For those who might be interested, here is a short article on how the author used the results of this model to train an LSTM model for signature detection in emails: LSTM model for email signature detection

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご