Propositionizer-Wiki-Flan-T5-Large開源命題分割模型 - 免費將文本拆成獨立命題單元

首頁

Propositionizer Wiki Flan T5 Large

由chentong00開發

該模型是一個基於Flan-T5-Large的命題分割模型，用於將文本內容分解為獨立的命題單元。

大型語言模型

Transformers

開源協議:Apache-2.0 #文本命題分割 #知識密集檢索 #JSON結構化輸出

下載量 892

發布時間 : 11/11/2023

模型概述

該模型主要用於將複雜的文本段落分解為簡短的、獨立的命題單元，便於信息檢索和分析。

模型特點

文本命題分割

能夠將複雜文本內容分解為獨立的命題單元，便於後續處理和分析。

結構化輸出

輸出為JSON格式的命題列表，便於程序處理。

多級輸入支持

支持標題、章節和內容的多級輸入，提高分割準確性。

模型能力

文本分割

信息提取

結構化輸出

使用案例

信息檢索

維基百科內容分析

將維基百科文章分解為獨立命題，便於建立更細粒度的檢索系統。

提高檢索系統的精確度和召回率

知識圖譜構建

知識單元提取

從文本中提取獨立的知識單元，用於構建知識圖譜。

提高知識圖譜的構建效率和質量

🚀 命題分割模型

本模型是由陳等人在2023年發表的論文"Dense X Retrieval: What Retrieval Granularity Should We Use?"中提出的命題分割模型。該模型能夠將輸入的文本內容分解為多個命題，以JSON格式輸出。

🚀 快速開始

本模型的輸入提示格式為：Title: {標題}. Section: {章節}. Content: {內容}，輸出為JSON格式的命題列表。

例如，使用該模型分解以下段落：

Title: Leaning Tower of Pisa. Section: . Content: Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees, but the tower now leans at about 3.99 degrees. This means the top of the tower is displaced horizontally 3.9 meters (12 ft 10 in) from the center.

輸出將是：

["Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees.", "Leaning Tower of Pisa now leans at about 3.99 degrees.", "The top of Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center."]

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import json

model_name = "chentong00/propositionizer-wiki-flan-t5-large"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

title = "Leaning Tower of Pisa"
section = ""
content = "Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees, but the tower now leans at about 3.99 degrees. This means the top of the tower is displaced horizontally 3.9 meters (12 ft 10 in) from the center."

input_text = f"Title: {title}. Section: {section}. Content: {content}"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids.to(device), max_new_tokens=512).cpu()

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
try:
    prop_list = json.loads(output_text)
except:
    prop_list = []
    print("[ERROR] Failed to parse output text as JSON.")
print(json.dumps(prop_list, indent=2))

預期輸出：

[
  "Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees.",
  "Leaning Tower of Pisa now leans at about 3.99 degrees.",
  "The top of Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center."
]

📄 許可證

本項目採用Apache-2.0許可證。

📚 引用

如果您在研究中使用了本模型，請引用以下論文：

@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Tong Chen and Hongwei Wang and Sihao Chen and Wenhao Yu and Kaixin Ma and Xinran Zhao and Hongming Zhang and Dong Yu},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023},
  URL = {https://arxiv.org/pdf/2312.06648.pdf}
}