HTML Pruner Phi 3.8B
An HTML pruning model designed for RAG systems where HTML is more suitable than plain text for modeling retrieval results
Downloads 319
Release Time : 10/16/2024
Model Overview
This model specializes in processing HTML-formatted retrieval results, optimizing knowledge retrieval efficiency in RAG systems through lossless HTML cleaning and two-step HTML pruning based on block trees.
Model Features
Lossless HTML Cleaning
Only removes completely irrelevant content and compresses redundant structures while preserving all semantic information in the original HTML.
Two-Step HTML Pruning Based on Block Trees
First step uses embedding models to calculate block scores, second step uses path generation models to achieve efficient HTML pruning.
HTML Format Optimization
Specifically optimizes HTML-formatted retrieval results for RAG systems to improve knowledge retrieval efficiency.
Model Capabilities
HTML Document Cleaning
HTML Content Pruning
Semantic Information Preservation
RAG System Optimization
Use Cases
Information Retrieval
Web Content Simplification
Extracts key information from complex HTML web pages while removing redundant content.
Obtains more concise HTML content that retains semantic meaning.
RAG System Knowledge Formatting
Prepares HTML-formatted external knowledge sources for RAG systems.
Improves retrieval efficiency and accuracy of RAG systems.
đ HTML Pruner Model
This project releases the HTML pruner model used in HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems. It proposes HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To handle the long context brought by HTML, it also proposes Lossless HTML Cleaning and Two-Step Block-Tree-Based HTML Pruning.
⨠Features
- Lossless HTML Cleaning: Removes totally irrelevant contents and compresses redundant structures, retaining all semantic information in the original HTML. Suitable for RAG systems with long-context LLMs that don't want to lose any information before generation.
- Two-Step Block-Tree-Based HTML Pruning: Consists of two steps conducted on the block tree structure. The first step uses an embedding model to score blocks, and the second step uses a path generative model.
đĻ Installation
Install the package using pip:
pip install htmlrag
Or install the package from source:
pip install -e .
đģ Usage Examples
Basic Usage
HTML Cleaning
from htmlrag import clean_html
question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<h1>Bellagio Hotel in Las</h1>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""
#. alternatively you can read html files and merge them
# html_files=["/path/to/html/file1.html", "/path/to/html/file2.html"]
# htmls=[open(file).read() for file in html_files]
# html = "\n".join(htmls)
simplified_html = clean_html(html)
print(simplified_html)
# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>
Configure Pruning Parameters
# Maximum number of words in a node when constructing the block tree for pruning with the embedding model
MAX_NODE_WORDS_EMBED = 10
# MAX_NODE_WORDS_EMBED = 256 # a recommended setting for real-world HTML documents
# Maximum number of tokens in the output HTML document pruned with the embedding model
MAX_CONTEXT_WINDOW_EMBED = 60
# MAX_CONTEXT_WINDOW_EMBED = 6144 # a recommended setting for real-world HTML documents
# Maximum number of words in a node when constructing the block tree for pruning with the generative model
MAX_NODE_WORDS_GEN = 5
# MAX_NODE_WORDS_GEN = 128 # a recommended setting for real-world HTML documents
# Maximum number of tokens in the output HTML document pruned with the generative model
MAX_CONTEXT_WINDOW_GEN = 32
# MAX_CONTEXT_WINDOW_GEN = 4096 # a recommended setting for real-world HTML documents
Build Block Tree
from htmlrag import build_block_tree
block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
# block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
for block in block_tree:
print("Block Content: ", block[0])
print("Block Path: ", block[1])
print("Is Leaf: ", block[2])
print("")
# Block Content: <h1>Bellagio Hotel in Las</h1>
# Block Path: ['html', 'title']
# Is Leaf: True
#
# Block Content: <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path: ['html', 'div']
# Is Leaf: True
#
# Block Content: <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path: ['html', 'p']
# Is Leaf: True
Prune HTML Blocks with Embedding Model
from htmlrag import EmbedHTMLPruner
embed_model="BAAI/bge-large-en"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)
# [2, 0, 1]
#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings = bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)
# [2, 0, 1]
from transformers import AutoTokenizer
chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_EMBED)
print(pruned_html)
# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>
Prune HTML Blocks with Generative Model
from htmlrag import GenHTMLPruner
import torch
# construct a finer block tree
block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
# block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
for block in block_tree:
print("Block Content: ", block[0])
print("Block Path: ", block[1])
print("Is Leaf: ", block[2])
print("")
# Block Content: <h1>Bellagio Hotel in Las</h1>
# Block Path: ['html', 'title']
# Is Leaf: True
#
# Block Content: <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path: ['html', 'p']
# Is Leaf: True
ckpt_path = "zstanjj/HTML-Pruner-Phi-3.8B"
if torch.cuda.is_available():
device="cuda"
else:
device="cpu"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, device=device)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html, block_tree)
print(block_rankings)
# [1, 0]
pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_GEN)
print(pruned_html)
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
đ Documentation
Model Information
We release the HTML pruner model used in HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems.
Useful links:
If you use this model, please star our GitHub repository to support us. Your star means a lot!
Results
- Results for HTML-Pruner-Phi-3.8B and HTML-Pruner-Llama-1B with Llama-3.1-70B-Instruct as chat model.
Dataset | ASQA | HotpotQA | NQ | TriviaQA | MuSiQue | ELI5 |
---|---|---|---|---|---|---|
Metrics | EM | EM | EM | EM | EM | ROUGE-L |
BM25 | 49.50 | 38.25 | 47.00 | 88.00 | 9.50 | 16.15 |
BGE | 68.00 | 41.75 | 59.50 | 93.00 | 12.50 | 16.20 |
E5-Mistral | 63.00 | 36.75 | 59.50 | 90.75 | 11.00 | 16.17 |
LongLLMLingua | 62.50 | 45.00 | 56.75 | 92.50 | 10.25 | 15.84 |
JinaAI Reader | 55.25 | 34.25 | 48.25 | 90.00 | 9.25 | 16.06 |
HtmlRAG-Phi-3.8B | 68.50 | 46.25 | 60.50 | 93.50 | 13.25 | 16.33 |
HtmlRAG-Llama-1B | 66.50 | 45.00 | 60.75 | 93.00 | 10.00 | 16.25 |
đ License
This project is licensed under the apache-2.0
license.
đ Citation
@misc{tan2024htmlraghtmlbetterplain,
title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems},
author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},
year={2024},
eprint={2411.02959},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2411.02959},
}
Phi 2 GGUF
Other
Phi-2 is a small yet powerful language model developed by Microsoft, featuring 2.7 billion parameters, focusing on efficient inference and high-quality text generation.
Large Language Model Supports Multiple Languages
P
TheBloke
41.5M
205
Roberta Large
MIT
A large English language model pre-trained with masked language modeling objectives, using improved BERT training methods
Large Language Model English
R
FacebookAI
19.4M
212
Distilbert Base Uncased
Apache-2.0
DistilBERT is a distilled version of the BERT base model, maintaining similar performance while being more lightweight and efficient, suitable for natural language processing tasks such as sequence classification and token classification.
Large Language Model English
D
distilbert
11.1M
669
Llama 3.1 8B Instruct GGUF
Meta Llama 3.1 8B Instruct is a multilingual large language model optimized for multilingual dialogue use cases, excelling in common industry benchmarks.
Large Language Model English
L
modularai
9.7M
4
Xlm Roberta Base
MIT
XLM-RoBERTa is a multilingual model pretrained on 2.5TB of filtered CommonCrawl data across 100 languages, using masked language modeling as the training objective.
Large Language Model Supports Multiple Languages
X
FacebookAI
9.6M
664
Roberta Base
MIT
An English pre-trained model based on Transformer architecture, trained on massive text through masked language modeling objectives, supporting text feature extraction and downstream task fine-tuning
Large Language Model English
R
FacebookAI
9.3M
488
Opt 125m
Other
OPT is an open pre-trained Transformer language model suite released by Meta AI, with parameter sizes ranging from 125 million to 175 billion, designed to match the performance of the GPT-3 series while promoting open research in large-scale language models.
Large Language Model English
O
facebook
6.3M
198
1
A pretrained model based on the transformers library, suitable for various NLP tasks
Large Language Model
Transformers

1
unslothai
6.2M
1
Llama 3.1 8B Instruct
Llama 3.1 is Meta's multilingual large language model series, featuring 8B, 70B, and 405B parameter scales, supporting 8 languages and code generation, with optimized multilingual dialogue scenarios.
Large Language Model
Transformers Supports Multiple Languages

L
meta-llama
5.7M
3,898
T5 Base
Apache-2.0
The T5 Base Version is a text-to-text Transformer model developed by Google with 220 million parameters, supporting multilingual NLP tasks.
Large Language Model Supports Multiple Languages
T
google-t5
5.4M
702
Featured Recommended AI Models
Š 2025AIbase