Provence-reranker-debertav3-v1開源模型 - 輕量級適用於問答場景的檢索增強生成工具

首頁

Provence Reranker Debertav3 V1

由naver開發

Provence是一款輕量級的上下文剪枝模型，專為檢索增強生成優化，尤其適用於問答場景。

大型語言模型

Safetensors

英語#問答優化 #上下文剪枝 #檢索增強

下載量 1,506

發布時間 : 12/11/2024

模型概述

Provence能移除段落中與用戶問題無關的句子，適用於任何大語言模型（LLM），能加速生成過程並減少上下文噪聲。

模型特點

上下文剪枝

自動檢測並移除段落中與用戶問題無關的句子，減少上下文噪聲。

多領域適用

訓練數據結合了多樣化的MS Marco和自然問題數據集，適用於各種領域。

即插即用

可與任何大語言模型（LLM）配合使用，無需額外調整。

共指關係捕捉

同時編碼段落中的所有句子，能夠捕捉句子之間的共指關係，提供更準確的上下文剪枝。

模型能力

文本重排序

上下文剪枝

問答優化

使用案例

問答系統

維基百科問答

在維基百科文章中剪枝無關句子，提高問答準確性。

減少上下文噪聲，加速生成過程。

檢索增強生成

LLM上下文優化

為大語言模型（LLM）提供剪枝後的上下文，減少無關信息干擾。

提高生成效率和質量。

🚀 Provence-reranker模型卡片

Provence是一款輕量級的上下文修剪模型，專為檢索增強生成而設計，尤其針對問答場景進行了優化。給定用戶問題和檢索到的段落，Provence能夠從段落中移除與用戶問題無關的句子。這一特性以即插即用的方式適用於任何大語言模型（LLM），不僅加快了生成速度，還減少了上下文噪聲。

模型圖片

論文：https://arxiv.org/abs/2501.16214，已被ICLR 2025接收
博客文章：https://huggingface.co/blog/nadiinchi/provence
開發者：Naver Labs Europe
許可證：CC BY - NC 4.0
模型：provence-reranker-debertav3-v1（用於修剪和重新排序檢索到的相關上下文的Provence模型）
骨幹模型：DeBERTav3 - reranker（基於DeBERTa - v3 - large訓練）
模型大小：4.3億參數
上下文長度：512個標記

🚀 快速開始

安裝依賴

Provence使用nltk，你可以通過以下命令進行安裝：

pip install nltk
python -c "import nltk; nltk.download('punkt_tab')"

單問題單上下文修剪示例

from transformers import AutoModel

provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)

context = "Shepherd’s pie. History. In early cookery books, the dish was a means of using leftover roasted meat of any kind, and the pie dish was lined on the sides and bottom with mashed potato, as well as having a mashed potato crust on top. Variations and similar dishes. Other potato-topped pies include: The modern ”Cumberland pie” is a version with either beef or lamb and a layer of bread- crumbs and cheese on top. In medieval times, and modern-day Cumbria, the pastry crust had a filling of meat with fruits and spices.. In Quebec, a varia- tion on the cottage pie is called ”Paˆte ́ chinois”. It is made with ground beef on the bottom layer, canned corn in the middle, and mashed potato on top.. The ”shepherdess pie” is a vegetarian version made without meat, or a vegan version made without meat and dairy.. In the Netherlands, a very similar dish called ”philosopher’s stew” () often adds ingredients like beans, apples, prunes, or apple sauce.. In Brazil, a dish called in refers to the fact that a manioc puree hides a layer of sun-dried meat."
question = 'What goes on the bottom of Shepherd’s pie?'

provence_output = provence.process(question, context)
# print(f"Provence Output: {provence_output}")
# Provence Output: {'reranking_score': 3.022725, pruned_context': 'In early cookery books, the dish was a means of using leftover roasted meat of any kind, and the pie dish was lined on the sides and bottom with mashed potato, as well as having a mashed potato crust on top.']]

批量處理

你還可以傳入問題列表和上下文列表（每個問題對應多個上下文）進行批量處理。

保留標題

設置always_select_title=True將保留第一句 “Shepherd’s pie”。這對於維基百科文章特別有用，因為標題通常有助於理解上下文。更多關於標題定義的詳細信息如下：

provence_output = provence.process(question, context, always_select_title=True)
# print(f"Provence Output: {provence_output}")
# Provence Output: {'reranking_score': 3.022725, pruned_context': 'Shepherd’s pie. In early cookery books, the dish was a means of using leftover roasted meat of any kind, and the pie dish was lined on the sides and bottom with mashed potato, as well as having a mashed potato crust on top.']]

✨ 主要特性

聯合編碼句子：Provence對段落中的所有句子進行聯合編碼，這使得它能夠捕捉句子之間的指代關係，從而更準確地修剪上下文。
自動檢測保留句子數量：Provence根據閾值自動檢測要保留的句子數量。我們發現默認閾值在各種領域都表現良好，但也可以根據具體用例進一步調整。
跨領域魯棒性：Provence在多樣化的MS Marco和Natural Questions數據組合上進行訓練，因此對各種領域具有魯棒性。
與任何LLM兼容：Provence可以直接與任何大語言模型配合使用。

💻 使用示例

基礎用法

from transformers import AutoModel

provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)

context = "Shepherd’s pie. History. In early cookery books, the dish was a means of using leftover roasted meat of any kind, and the pie dish was lined on the sides and bottom with mashed potato, as well as having a mashed potato crust on top. Variations and similar dishes. Other potato-topped pies include: The modern ”Cumberland pie” is a version with either beef or lamb and a layer of bread- crumbs and cheese on top. In medieval times, and modern-day Cumbria, the pastry crust had a filling of meat with fruits and spices.. In Quebec, a varia- tion on the cottage pie is called ”Paˆte ́ chinois”. It is made with ground beef on the bottom layer, canned corn in the middle, and mashed potato on top.. The ”shepherdess pie” is a vegetarian version made without meat, or a vegan version made without meat and dairy.. In the Netherlands, a very similar dish called ”philosopher’s stew” () often adds ingredients like beans, apples, prunes, or apple sauce.. In Brazil, a dish called in refers to the fact that a manioc puree hides a layer of sun-dried meat."
question = 'What goes on the bottom of Shepherd’s pie?'

provence_output = provence.process(question, context)

高級用法

# 批量處理示例，傳入問題列表和上下文列表
questions = ['What goes on the bottom of Shepherd’s pie?', 'Another question']
contexts = [
    "Shepherd’s pie. History. In early cookery books, the dish was a means of using leftover roasted meat of any kind, and the pie dish was lined on the sides and bottom with mashed potato, as well as having a mashed potato crust on top. Variations and similar dishes. Other potato-topped pies include: The modern ”Cumberland pie” is a version with either beef or lamb and a layer of bread- crumbs and cheese on top. In medieval times, and modern-day Cumbria, the pastry crust had a filling of meat with fruits and spices.. In Quebec, a varia- tion on the cottage pie is called ”Paˆte ́ chinois”. It is made with ground beef on the bottom layer, canned corn in the middle, and mashed potato on top.. The ”shepherdess pie” is a vegetarian version made without meat, or a vegan version made without meat and dairy.. In the Netherlands, a very similar dish called ”philosopher’s stew” () often adds ingredients like beans, apples, prunes, or apple sauce.. In Brazil, a dish called in refers to the fact that a manioc puree hides a layer of sun-dried meat.",
    "Another context"
]

provence_output = provence.process(questions, contexts)

📚 詳細文檔

模型接口

process函數的接口如下：

參數	類型	詳情
`question`	`Union[List[str], str]`	輸入問題（單個字符串）或問題列表（用於批量處理）
`context`	`Union[List[List[str]], str]`	要修剪的上下文。可以是單個字符串（單個問題的情況），也可以是上下文列表（每個問題對應多個上下文），`contexts`的長度應等於`questions`的長度
`title`	`Optional[Union[List[List[str]], str]]`，默認值: “first_sentence”	定義標題的可選參數。如果`title = first_sentence`，則假設每個上下文的第一句為標題。如果`title = None`，則假設未提供標題。標題也可以作為字符串列表的列表傳入，其形狀應與上下文相同。僅當`always_select_title = True`時使用標題
`threshold`	`float, ∈ [0, 1]`，默認值: 0.1	上下文修剪的閾值。我們建議使用0.1進行更保守的修剪（無性能下降或最低性能下降），使用0.5進行更高的壓縮，但該值可以進一步調整以滿足特定用例的要求
`always_select_title`	`bool`，默認值: True	如果為True，每次模型選擇非空句子集合時，第一句（標題）將被包含在選擇中。這對於維基百科段落等很重要，以便為後續句子提供適當的上下文
`batch_size`	`int`，默認值: 32	批量大小
`reorder`	`bool`，默認值: False	如果為True，每個問題的上下文將根據計算的問題 - 段落相關性得分重新排序。如果為False，將保留用戶提供的上下文原始順序
`top_k`	`int`，默認值: 5	如果`reorder = True`，指定每個問題要保留的排名最高的段落數量
`enable_warnings`	`bool`，默認值: True	用戶是否希望打印有關模型使用的警告信息，例如上下文或問題過長

模型詳情

輸入：用戶問題（例如，一個句子） + 檢索到的上下文段落（例如，一個段落）
輸出：修剪後的上下文段落（即移除了無關句子） + 相關性得分（可用於重新排序）
模型架構：該模型基於DeBERTav3 - reranker初始化，並通過兩個目標進行微調：(1) 輸出可用於修剪無關句子的二進制掩碼；(2) 保留初始的重新排序能力。
訓練數據：MS Marco（文檔） + NQ訓練集，使用LLama - 3 - 8B對要保留的句子進行合成銀標籤標註。
支持語言：英語
上下文長度：512個標記（與預訓練的DeBERTa模型相似）
評估：我們在來自不同領域的7個數據集上對Provence進行評估，包括維基百科、生物醫學數據、課程大綱和新聞。評估是在僅在MS Marco數據上訓練的模型上進行的。我們發現Provence能夠在所有領域修剪無關句子，且性能下降很小或無下降，並且在帕累託前沿（圖的右上角）上優於現有的基線模型。

更多分析請查看論文！評估結果圖片

📄 許可證

本作品採用CC BY - NC 4.0許可證。

🔗 引用

@misc{chirkova2025provenceefficientrobustcontext,
      title={Provence: efficient and robust context pruning for retrieval-augmented generation}, 
      author={Nadezhda Chirkova and Thibault Formal and Vassilina Nikoulina and Stéphane Clinchant},
      year={2025},
      eprint={2501.16214},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.16214}, 
}