HTML-Pruner-Phi-3.8B開源HTML修剪模型 - 優化RAG系統檢索結果建模

首頁

HTML Pruner Phi 3.8B

由zstanjj開發

用於HTML在RAG系統中比純文本更適合建模檢索結果的HTML修剪模型

大型語言模型

Transformers

英語開源協議:Apache-2.0 #HTML無損清理 #RAG系統優化 #塊樹修剪

下載量 319

發布時間 : 10/16/2024

模型概述

該模型專注於處理HTML格式的檢索結果，通過無損HTML清理和基於塊樹的兩步HTML修剪技術，優化RAG系統中的知識檢索效率。

模型特點

無損HTML清理

僅刪除完全不相關的內容並壓縮冗餘結構，保留原始HTML中的所有語義信息

基於塊樹的兩步HTML修剪

第一步使用嵌入模型計算塊的分數，第二步使用路徑生成模型，實現高效HTML修剪

HTML格式優化

專門針對RAG系統優化HTML格式的檢索結果，提高知識檢索效率

模型能力

HTML文檔清理

HTML內容修剪

語義信息保留

RAG系統優化

使用案例

信息檢索

網頁內容精簡

從複雜HTML網頁中提取關鍵信息，去除冗餘內容

獲得更簡潔且保留語義的HTML內容

RAG系統知識格式化

為RAG系統準備HTML格式的外部知識源

提高RAG系統的檢索效率和準確性

🚀 HTMLRAG模型

我們發佈了用於RAG系統的HTML修剪器模型，該模型使用HTML而非純文本處理檢索結果，在信息處理上更具優勢。

🚀 快速開始

本項目發佈了用於 HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems 的HTML修剪器模型。

有用的鏈接：📄 論文 • 🤗 Hugging Face • 💻 Github

我們提出了HtmlRAG，它在RAG系統中使用HTML而非純文本作為外部知識的格式。為了處理HTML帶來的長上下文問題，我們提出了 無損HTML清理 和 基於塊樹的兩步HTML修剪 方法。

無損HTML清理：此清理過程僅移除完全不相關的內容並壓縮冗餘結構，保留原始HTML中的所有語義信息。無損HTML清理後的壓縮HTML適用於具有長上下文大語言模型且在生成前不願丟失任何信息的RAG系統。
基於塊樹的兩步HTML修剪：基於塊樹的HTML修剪包括兩個步驟，均在塊樹結構上進行。第一步修剪使用嵌入模型為塊計算分數，第二步使用路徑生成模型。第一步處理無損HTML清理的結果，第二步處理第一步修剪的結果。

🌟 如果你使用此模型，請給我們的 GitHub倉庫 點個星支持我們。你的支持對我們意義重大！

✨ 主要特性

使用HTML作為RAG系統外部知識的格式，提升信息處理效果。
提出無損HTML清理和基於塊樹的兩步HTML修剪方法，有效處理HTML長上下文問題。

📦 安裝指南

使用pip安裝包：

pip install htmlrag

或者從源代碼安裝包：

pip install -e .

💻 使用示例

基礎用法

HTML清理

from htmlrag import clean_html

question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<h1>Bellagio Hotel in Las</h1>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""

#. alternatively you can read html files and merge them
# html_files=["/path/to/html/file1.html", "/path/to/html/file2.html"]
# htmls=[open(file).read() for file in html_files]
# html = "\n".join(htmls)

simplified_html = clean_html(html)
print(simplified_html)

# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>

配置修剪參數

真實世界的HTML文檔可能更長、更復雜，可配置以下參數：

# Maximum number of words in a node when constructing the block tree for pruning with the embedding model
MAX_NODE_WORDS_EMBED = 10 
# MAX_NODE_WORDS_EMBED = 256 # a recommended setting for real-world HTML documents
# Maximum number of tokens in the output HTML document pruned with the embedding model
MAX_CONTEXT_WINDOW_EMBED = 60
# MAX_CONTEXT_WINDOW_EMBED = 6144 # a recommended setting for real-world HTML documents
# Maximum number of words in a node when constructing the block tree for pruning with the generative model
MAX_NODE_WORDS_GEN = 5
# MAX_NODE_WORDS_GEN = 128 # a recommended setting for real-world HTML documents
# Maximum number of tokens in the output HTML document pruned with the generative model
MAX_CONTEXT_WINDOW_GEN = 32
# MAX_CONTEXT_WINDOW_GEN = 4096 # a recommended setting for real-world HTML documents

構建塊樹

from htmlrag import build_block_tree

block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
# block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <h1>Bellagio Hotel in Las</h1>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path:  ['html', 'div']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

使用嵌入模型修剪HTML塊

from htmlrag import EmbedHTMLPruner

embed_model="BAAI/bge-large-en"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [2, 0, 1]

#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings = bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [2, 0, 1]

from transformers import AutoTokenizer

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_EMBED)
print(pruned_html)

# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>

使用生成模型修剪HTML塊

from htmlrag import GenHTMLPruner
import torch

# construct a finer block tree
block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
# block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")
    
# Block Content:  <h1>Bellagio Hotel in Las</h1>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

ckpt_path = "zstanjj/HTML-Pruner-Phi-3.8B"
if torch.cuda.is_available():
    device="cuda"
else:
    device="cpu"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, device=device)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html, block_tree)
print(block_rankings)

# [1, 0]

pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_GEN)
print(pruned_html)

# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>

📚 詳細文檔

模型結果

HTML-Pruner-Phi-3.8B 和 HTML-Pruner-Llama-1B 以Llama-3.1-70B-Instruct作為聊天模型的結果。

數據集	ASQA	HotpotQA	NQ	TriviaQA	MuSiQue	ELI5
指標	EM	EM	EM	EM	EM	ROUGE-L
BM25	49.50	38.25	47.00	88.00	9.50	16.15
BGE	68.00	41.75	59.50	93.00	12.50	16.20
E5-Mistral	63.00	36.75	59.50	90.75	11.00	16.17
LongLLMLingua	62.50	45.00	56.75	92.50	10.25	15.84
JinaAI Reader	55.25	34.25	48.25	90.00	9.25	16.06
HtmlRAG-Phi-3.8B	68.50	46.25	60.50	93.50	13.25	16.33
HtmlRAG-Llama-1B	66.50	45.00	60.75	93.00	10.00	16.25

📄 許可證

本項目採用Apache 2.0許可證。

📚 引用

@misc{tan2024htmlraghtmlbetterplain,
      title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems}, 
      author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},
      year={2024},
      eprint={2411.02959},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2411.02959}, 
}