Distilcamembert-base-ner open-source French Named Entity Recognition model - Inference speed is twice as fast as the original version

Distilcamembert Base Ner

Developed by cmarkea

A French named entity recognition model fine-tuned on DistilCamemBERT, with inference speed twice as fast as the original CamemBERT

Sequence Labeling

Transformers

FrenchOpen Source License:MIT #French NER #Efficient Inference #Lightweight Model

Downloads 5,783

Release Time : 3/2/2022

Model Overview

Used for named entity recognition in French text, capable of identifying entity types such as persons, locations, and organizations

Model Features

Efficient Inference

Inference time is halved compared to the original CamemBERT model

Multi-category Recognition

Can identify various entity types such as persons (PER), locations (LOC), and organizations (ORG)

High Accuracy

Achieves an overall F1 score of 98.18% on the wikiner_fr dataset

Model Capabilities

French Text Processing

Named Entity Recognition

Entity Classification

Use Cases

Information Extraction

Financial Document Analysis

Extract company names, person names, and location information from financial documents

Accurately identifies financial institution names such as Crédit Mutuel Arkéa

News Analysis

Extract key entity information from news articles

Can identify entities such as person names (Louis Lichou) and locations (Bretagne)

🚀 DistilCamemBERT-NER

We present DistilCamemBERT-NER, a model based on DistilCamemBERT that has been fine - tuned for Named Entity Recognition (NER) in French. This work is inspired by Jean - Baptiste/camembert-ner, which is built on the CamemBERT model. The issue with CamemBERT - based models is scalability, especially during the production phase. Inference cost can be a significant technological hurdle. To address this, we introduce this model that cuts the inference time in half while maintaining the same power consumption, thanks to DistilCamemBERT.

✨ Features

Fine - tuned DistilCamemBERT for French NER.
Reduces inference time by half compared to CamemBERT - based models with the same power consumption.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline

ner = pipeline(
    task='ner',
    model="cmarkea/distilcamembert-base-ner",
    tokenizer="cmarkea/distilcamembert-base-ner",
    aggregation_strategy="simple"
)
result = ner(
    "Le Crédit Mutuel Arkéa est une banque Française, elle comprend le CMB "
    "qui est une banque située en Bretagne et le CMSO qui est une banque "
    "qui se situe principalement en Aquitaine. C'est sous la présidence de "
    "Louis Lichou, dans les années 1980 que différentes filiales sont créées "
    "au sein du CMB et forment les principales filiales du groupe qui "
    "existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.)."
)

result
[{'entity_group': 'ORG',
  'score': 0.9974479,
  'word': 'Crédit Mutuel Arkéa',
  'start': 3,
  'end': 22},
 {'entity_group': 'LOC',
  'score': 0.9000358,
  'word': 'Française',
  'start': 38,
  'end': 47},
 {'entity_group': 'ORG',
  'score': 0.9788757,
  'word': 'CMB',
  'start': 66,
  'end': 69},
 {'entity_group': 'LOC',
  'score': 0.99919766,
  'word': 'Bretagne',
  'start': 99,
  'end': 107},
 {'entity_group': 'ORG',
  'score': 0.9594884,
  'word': 'CMSO',
  'start': 114,
  'end': 118},
 {'entity_group': 'LOC',
  'score': 0.99935514,
  'word': 'Aquitaine',
  'start': 169,
  'end': 178},
 {'entity_group': 'PER',
  'score': 0.99911094,
  'word': 'Louis Lichou',
  'start': 208,
  'end': 220},
 {'entity_group': 'ORG',
  'score': 0.96226394,
  'word': 'CMB',
  'start': 291,
  'end': 294},
 {'entity_group': 'ORG',
  'score': 0.9983959,
  'word': 'Federal Finance',
  'start': 374,
  'end': 389},
 {'entity_group': 'ORG',
  'score': 0.9984454,
  'word': 'Suravenir',
  'start': 391,
  'end': 400},
 {'entity_group': 'ORG',
  'score': 0.9985084,
  'word': 'Financo',
  'start': 402,
  'end': 409}]

Advanced Usage

from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"
tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForTokenClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("token-classification", model=model, tokenizer=tokenizer)

# Quantized onnx model
quantized_model = ORTModelForTokenClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

📚 Documentation

Dataset

The dataset used is wikiner_fr, which contains approximately 170,000 sentences labeled in 5 categories:

PER: Personality
LOC: Location
ORG: Organization
MISC: Miscellaneous entities (e.g., movie titles, books)
O: Background (Outside entity)

Evaluation Results

class	precision (%)	recall (%)	f1 (%)	support (#sub - word)
global	98.17	98.19	98.18	378,776
PER	96.78	96.87	96.82	23,754
LOC	94.05	93.59	93.82	27,196
ORG	86.05	85.92	85.98	6,526
MISC	88.78	84.69	86.69	11,891
O	99.26	99.47	99.37	309,409

Benchmark

This model's performance is compared to 2 reference models using the f1 score metric. For the mean inference time measurement, an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used:

model	time (ms)	PER (%)	LOC (%)	ORG (%)	MISC (%)	O (%)
cmarkea/distilcamembert-base-ner	43.44	96.82	93.82	85.98	86.69	99.37
Davlan/bert-base-multilingual-cased-ner-hrl	87.56	79.93	72.89	61.34	n/a	96.04
flair/ner-french	314.96	82.91	76.17	70.96	76.29	97.65

🔧 Technical Details

The model is based on DistilCamemBERT and is fine - tuned for the NER task in French. It addresses the scalability issue of CamemBERT - based models by reducing the inference time by half while maintaining the same power consumption.

📄 License

The model is released under the MIT license.

📖 Citation

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご