QAmembert開源法語問答模型 - 支持答案有無兩種情況，精準解答法語疑問

首頁

Qamembert

由CATIE-AQ開發

QAmembert是基於CamemBERT基礎版針對法語問答任務進行微調的模型，訓練數據包含四種法語問答數據集，支持答案存在和不存在兩種情況。

問答系統

Transformers

法語開源協議:MIT #法語問答系統 #SQuAD格式適配 #無答案檢測

下載量 37

發布時間 : 1/10/2023

模型概述

該模型專門用於法語問答任務，能夠處理答案存在於上下文和答案不存在兩種情況，適用於多種法語問答場景。

模型特點

多數據集訓練

使用四種法語問答數據集進行訓練，總計221,348組上下文/問題/答案三元組，涵蓋多種問答格式。

支持無答案情況

能夠處理答案不存在於上下文中的情況，採用SQuAD 2.0格式進行訓練和評估。

高性能

在多個法語問答數據集上表現出色，F1值和精確匹配指標優於同類模型。

模型能力

法語問答

處理無答案情況

上下文理解

使用案例

教育

法語學習輔助

幫助學生通過問答形式學習法語知識

提供準確的答案和上下文理解

信息檢索

法語文檔問答

從法語文檔中快速獲取特定問題的答案

高效準確地提取相關信息

🚀 QAmembert

QAmembert 是基於 CamemBERT base 微調的模型，用於法語問答任務。它在多個法語問答數據集上進行訓練，能有效處理上下文中有答案和無答案的問答情況。

🚀 快速開始

環境準備

確保你已經安裝了 transformers 庫，可使用以下命令進行安裝：

pip install transformers

代碼示例

以下是使用 QAmembert 進行問答的示例代碼：

from transformers import pipeline

qa = pipeline('question-answering', model='CATIE-AQ/QAmembert', tokenizer='CATIE-AQ/QAmembert')

result = qa({
    'question': "Combien de personnes utilisent le français tous les jours ?",
    'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière.  Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
})

if result['score'] < 0.01:
    print("La réponse n'est pas dans le contexte fourni.")
else :
    print(result['answer'])

✨ 主要特性

多數據集訓練：基於四個法語問答數據集進行微調，涵蓋了 SQuAD 1.0 和 SQuAD 2.0 格式的數據，訓練數據豐富。
處理不同格式：能夠處理上下文中有答案和無答案的問答情況，具有較強的適應性。
評估指標良好：在多個評估數據集上表現良好，如 FQuAD 1.0、qwant/squad_fr 和 frenchQA 等。

📦 數據集

數據集	格式	訓練集劃分	驗證集劃分	測試集劃分
piaf	SQuAD 1.0	9 224 個問答對	X	X
piaf_v2	SQuAD 2.0	9 224 個問答對	X	X
fquad	SQuAD 1.0	20 731 個問答對	3 188 個問答對（未用於訓練，作為測試數據集）	2 189 個問答對（未在本工作中使用，因不可免費獲取）
fquad_v2	SQuAD 2.0	20 731 個問答對	3 188 個問答對（未用於訓練，作為測試數據集）	X
lincoln/newsquadfr	SQuAD 1.0	1 650 個問答對	455 個問答對（未在本工作中使用）	X
lincoln/newsquadfr_v2	SQuAD 2.0	1 650 個問答對	455 個問答對（未在本工作中使用）	X
pragnakalp/squad_v2_french_translated	SQuAD 2.0	79 069 個問答對	X	X
pragnakalp/squad_v2_french_translated_v2	SQuAD 2.0	79 069 個問答對	X	X

所有這些數據集被合併為一個名為 frenchQA 的單一數據集。

📚 評估結果

評估使用了 evaluate Python 包進行。

FQuaD 1.0（驗證集）

使用的評估指標為 SQuAD 1.0。

模型	精確匹配率	F1 分數
etalab-ia/camembert-base-squadFR-fquad-piaf	53.60	78.09
QAmembert（上一版本）	54.26	77.87
QAmembert（當前版本）	53.98	78.00
QAmembert-large	55.95	81.05

qwant/squad_fr（驗證集）

使用的評估指標為 SQuAD 1.0。

模型	精確匹配率	F1 分數
etalab-ia/camembert-base-squadFR-fquad-piaf	60.17	78.27
QAmembert（上一版本）	60.40	77.27
QAmembert（當前版本）	60.95	77.30
QAmembert-large	65.58	81.74

frenchQA

該數據集包含上下文中無答案的問題。使用的評估指標為 SQuAD 2.0。

模型	精確匹配率	F1 分數	答案 F1 分數	無答案 F1 分數
etalab-ia/camembert-base-squadFR-fquad-piaf	n/a	n/a	n/a	n/a
QAmembert（上一版本）	60.28	71.29	75.92	66.65
QAmembert（當前版本）	77.14	86.88	75.66	98.11
QAmembert-large	77.14	88.74	78.83	98.65

💻 使用示例

基礎用法

以下是上下文中有答案的示例：

from transformers import pipeline

qa = pipeline('question-answering', model='CATIE-AQ/QAmembert', tokenizer='CATIE-AQ/QAmembert')

result = qa({
    'question': "Combien de personnes utilisent le français tous les jours ?",
    'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière.  Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
})

if result['score'] < 0.01:
    print("La réponse n'est pas dans le contexte fourni.")
else :
    print(result['answer'])

235 millions

# 詳細信息
result
{'score': 0.9945194721221924,
 'start': 269,
 'end': 281,
 'answer': '235 millions'}

高級用法

以下是上下文中無答案的示例：

from transformers import pipeline

qa = pipeline('question-answering', model='CATIE-AQ/QAmembert', tokenizer='CATIE-AQ/QAmembert')

result = qa({
    'question': "Quel est le meilleur vin du monde ?",
    'context': "La tour Eiffel est une tour de fer puddlé de 330 m de hauteur (avec antennes) située à Paris, à l’extrémité nord-ouest du parc du Champ-de-Mars en bordure de la Seine dans le 7e arrondissement. Son adresse officielle est 5, avenue Anatole-France.  
Construite en deux ans par Gustave Eiffel et ses collaborateurs pour l'Exposition universelle de Paris de 1889, célébrant le centenaire de la Révolution française, et initialement nommée « tour de 300 mètres », elle est devenue le symbole de la capitale française et un site touristique de premier plan : il s’agit du quatrième site culturel français payant le plus visité en 2016, avec 5,9 millions de visiteurs. Depuis son ouverture au public, elle a accueilli plus de 300 millions de visiteurs."
})

if result['score'] < 0.01:
    print("La réponse n'est pas dans le contexte fourni.")
else :
    print(result['answer'])

La réponse n'est pas dans le contexte fourni.

# 詳細信息
result
{'score': 3.619904940035945e-13,
 'start': 734,
 'end': 744,
 'answer': 'visiteurs.'}

通過 Space 進行測試

可以通過這裡的 Space 來測試該模型。

🔧 技術細節

本模型基於 CamemBERT base 進行微調，使用了四個法語問答數據集。所有數據集被合併為一個名為 frenchQA 的單一數據集，共使用了超過 221,348 個上下文/問題/答案三元組進行微調，6,376 個進行測試。具體方法可參考英文博客或法文博客。

🌱 環境影響

碳排放量使用機器學習影響計算器進行估算，該計算器基於 Lacoste 等人 (2019) 的研究。使用硬件、運行時間、雲服務提供商和計算區域來估算碳影響。

硬件類型：A100 PCIe 40/80GB
使用時長：5 小時 36 分鐘
雲服務提供商：私有基礎設施
碳效率（kg/kWh）：0.076kg（根據 electricitymaps 估算；由於無法獲取訓練當天的數據，我們採用了 2023 年 3 月法國的平均碳強度。）
碳排放 （功耗 x 時間 x 基於電網位置的碳排放量）：0.1 kg 二氧化碳當量

📚 引用

QAmemBERT

@misc {qamembert2023,  
    author       = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { QAmembert (Revision 9685bc3) },  
    year         = 2023,  
    url          = { https://huggingface.co/CATIE-AQ/QAmembert},  
    doi          = { 10.57967/hf/0821 },  
    publisher    = { Hugging Face }  
}

PIAF

@inproceedings{KeraronLBAMSSS20,
  author    = {Rachel Keraron and
               Guillaume Lancrenon and
               Mathilde Bras and
               Fr{\'{e}}d{\'{e}}ric Allary and
               Gilles Moyse and
               Thomas Scialom and
               Edmundo{-}Pavel Soriano{-}Morales and
               Jacopo Staiano},
  title     = {Project {PIAF:} Building a Native French Question-Answering Dataset},
  booktitle = {{LREC}},
  pages     = {5481--5490},
  publisher = {European Language Resources Association},
  year      = {2020}
}

FQuAD

@article{dHoffschmidt2020FQuADFQ,
  title={FQuAD: French Question Answering Dataset},
  author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.06071}
}

lincoln/newsquadfr

Hugging Face repository: https://hf.co/datasets/lincoln/newsquadfr

pragnakalp/squad_v2_french_translated

Hugging Face repository: https://hf.co/datasets/pragnakalp/squad_v2_french_translated

CamemBERT

@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}