🚀 DistilCamemBERT-QA
We present DistilCamemBERT-QA, a model fine-tuned from DistilCamemBERT for the French language question-answering task. This model is trained on two datasets, FQuAD v1.0 and Piaf, which consist of contexts and questions with answers within the contexts.
🚀 Quick Start
Prerequisites
- Python environment
- Install necessary libraries:
transformers
, optimum
(if using ONNX)
Basic Usage
from transformers import pipeline
qa_engine = pipeline(
"question-answering",
model="cmarkea/distilcamembert-base-qa",
tokenizer="cmarkea/distilcamembert-base-qa"
)
result = qa_engine(
context="David Fincher, né le 28 août 1962 à Denver (Colorado), "
"est un réalisateur et producteur américain. Il est principalement "
"connu pour avoir réalisé les films Seven, Fight Club, L'Étrange "
"Histoire de Benjamin Button, The Social Network et Gone Girl qui "
"lui ont valu diverses récompenses et nominations aux Oscars du "
"cinéma ou aux Golden Globes. Réputé pour son perfectionnisme, il "
"peut tourner un très grand nombre de prises de ses plans et "
"séquences afin d'obtenir le rendu visuel qu'il désire. Il a "
"également développé et produit les séries télévisées House of "
"Cards (pour laquelle il remporte l'Emmy Award de la meilleure "
"réalisation pour une série dramatique en 2013) et Mindhunter, "
"diffusées sur Netflix.",
question="Quel est le métier de David Fincher ?"
)
result
{'score': 0.7981914281845093,
'start': 61,
'end': 98,
'answer': ' réalisateur et producteur américain.'}
Advanced Usage (Optimum + ONNX)
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
HUB_MODEL = "cmarkea/distilcamembert-base-qa"
tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForQuestionAnswering.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
quantized_model = ORTModelForQuestionAnswering.from_pretrained(
HUB_MODEL, file_name="model_quantized.onnx"
)
✨ Features
- Efficient Inference: Thanks to DistilCamemBERT, it divides the inference time by 2 with the same consumption power compared to models based on CamemBERT.
- Trained on Quality Datasets: Utilizes FQuAD v1.0 and Piaf datasets for training, ensuring high - quality question - answering performance.
📦 Installation
No specific installation steps are provided in the original document.
📚 Documentation
Dataset
The dataset includes FQuAD v1.0 and Piaf, with 24,566 question - answer pairs in the training set and 3,188 in the evaluation set.
Evaluation results and benchmark
We compare DistilCamemBERT-QA with two other French language models: etalab-ia/camembert-base-squadFR-fquad-piaf based on CamemBERT and fmikaelian/flaubert-base-uncased-squad based on FlauBERT.
For benchmarks, we use word - to - word comparison, f1 - score (measures the intersection quality between predicted responses and ground truth), and inclusion score (measures if the ground truth answer is included in the predicted answer). The mean inference time is measured on an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores.
⚠️ Important Note
Do not take into account the results of the FlauBERT model. The modeling seems to be a problem, as the results seem very low.
📄 License
The model is licensed under cc - by - nc - sa - 3.0.
🔧 Technical Details
This model is fine - tuned from DistilCamemBERT for the French question - answering task. It addresses the scaling issue of models based on CamemBERT by reducing the inference time while maintaining the same power consumption.
📖 Citation
@inproceedings{delestre:hal-03674695,
TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
URL = {https://hal.archives-ouvertes.fr/hal-03674695},
BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
ADDRESS = {Vannes, France},
YEAR = {2022},
MONTH = Jul,
KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
HAL_ID = {hal-03674695},
HAL_VERSION = {v1},
}