HTML-Pruner-Phi-3.8B Open-Source HTML Trimming Model - Optimizing Retrieval Result Modeling for RAG Systems

HTML Pruner Phi 3.8B

Developed by zstanjj

An HTML pruning model designed for RAG systems where HTML is more suitable than plain text for modeling retrieval results

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #HTML Lossless Cleaning #RAG System Optimization #Block Tree Pruning

Downloads 319

Release Time : 10/16/2024

Model Overview

This model specializes in processing HTML-formatted retrieval results, optimizing knowledge retrieval efficiency in RAG systems through lossless HTML cleaning and two-step HTML pruning based on block trees.

Model Features

Lossless HTML Cleaning

Only removes completely irrelevant content and compresses redundant structures while preserving all semantic information in the original HTML.

Two-Step HTML Pruning Based on Block Trees

First step uses embedding models to calculate block scores, second step uses path generation models to achieve efficient HTML pruning.

HTML Format Optimization

Specifically optimizes HTML-formatted retrieval results for RAG systems to improve knowledge retrieval efficiency.

Model Capabilities

HTML Document Cleaning

HTML Content Pruning

Semantic Information Preservation

RAG System Optimization

Use Cases

Information Retrieval

Web Content Simplification

Extracts key information from complex HTML web pages while removing redundant content.

Obtains more concise HTML content that retains semantic meaning.

RAG System Knowledge Formatting

Prepares HTML-formatted external knowledge sources for RAG systems.

Improves retrieval efficiency and accuracy of RAG systems.

🚀 HTML Pruner Model

This project releases the HTML pruner model used in HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems. It proposes HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To handle the long context brought by HTML, it also proposes Lossless HTML Cleaning and Two-Step Block-Tree-Based HTML Pruning.

✨ Features

Lossless HTML Cleaning: Removes totally irrelevant contents and compresses redundant structures, retaining all semantic information in the original HTML. Suitable for RAG systems with long-context LLMs that don't want to lose any information before generation.
Two-Step Block-Tree-Based HTML Pruning: Consists of two steps conducted on the block tree structure. The first step uses an embedding model to score blocks, and the second step uses a path generative model.

📦 Installation

Install the package using pip:

pip install htmlrag

Or install the package from source:

pip install -e .

💻 Usage Examples

Basic Usage

HTML Cleaning

from htmlrag import clean_html

question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<h1>Bellagio Hotel in Las</h1>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""

#. alternatively you can read html files and merge them
# html_files=["/path/to/html/file1.html", "/path/to/html/file2.html"]
# htmls=[open(file).read() for file in html_files]
# html = "\n".join(htmls)

simplified_html = clean_html(html)
print(simplified_html)

# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>

Configure Pruning Parameters

# Maximum number of words in a node when constructing the block tree for pruning with the embedding model
MAX_NODE_WORDS_EMBED = 10 
# MAX_NODE_WORDS_EMBED = 256 # a recommended setting for real-world HTML documents
# Maximum number of tokens in the output HTML document pruned with the embedding model
MAX_CONTEXT_WINDOW_EMBED = 60
# MAX_CONTEXT_WINDOW_EMBED = 6144 # a recommended setting for real-world HTML documents
# Maximum number of words in a node when constructing the block tree for pruning with the generative model
MAX_NODE_WORDS_GEN = 5
# MAX_NODE_WORDS_GEN = 128 # a recommended setting for real-world HTML documents
# Maximum number of tokens in the output HTML document pruned with the generative model
MAX_CONTEXT_WINDOW_GEN = 32
# MAX_CONTEXT_WINDOW_GEN = 4096 # a recommended setting for real-world HTML documents

Build Block Tree

from htmlrag import build_block_tree

block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
# block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <h1>Bellagio Hotel in Las</h1>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path:  ['html', 'div']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

Prune HTML Blocks with Embedding Model

from htmlrag import EmbedHTMLPruner

embed_model="BAAI/bge-large-en"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [2, 0, 1]

#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings = bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [2, 0, 1]

from transformers import AutoTokenizer

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_EMBED)
print(pruned_html)

# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>

Prune HTML Blocks with Generative Model

from htmlrag import GenHTMLPruner
import torch

# construct a finer block tree
block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
# block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")
    
# Block Content:  <h1>Bellagio Hotel in Las</h1>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

ckpt_path = "zstanjj/HTML-Pruner-Phi-3.8B"
if torch.cuda.is_available():
    device="cuda"
else:
    device="cpu"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, device=device)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html, block_tree)
print(block_rankings)

# [1, 0]

pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_GEN)
print(pruned_html)

# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>

📚 Documentation

Model Information

We release the HTML pruner model used in HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems.

Useful links:

If you use this model, please star our GitHub repository to support us. Your star means a lot!

Results

Results for HTML-Pruner-Phi-3.8B and HTML-Pruner-Llama-1B with Llama-3.1-70B-Instruct as chat model.

Dataset	ASQA	HotpotQA	NQ	TriviaQA	MuSiQue	ELI5
Metrics	EM	EM	EM	EM	EM	ROUGE-L
BM25	49.50	38.25	47.00	88.00	9.50	16.15
BGE	68.00	41.75	59.50	93.00	12.50	16.20
E5-Mistral	63.00	36.75	59.50	90.75	11.00	16.17
LongLLMLingua	62.50	45.00	56.75	92.50	10.25	15.84
JinaAI Reader	55.25	34.25	48.25	90.00	9.25	16.06
HtmlRAG-Phi-3.8B	68.50	46.25	60.50	93.50	13.25	16.33
HtmlRAG-Llama-1B	66.50	45.00	60.75	93.00	10.00	16.25

📄 License

This project is licensed under the apache-2.0 license.

📚 Citation

@misc{tan2024htmlraghtmlbetterplain,
      title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems}, 
      author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},
      year={2024},
      eprint={2411.02959},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2411.02959}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご