Propositionizer - Wiki - Flan - T5 - Largeオープンソース命題分割モデル - 無料でテキストを独立した命題ユニットに分割する

ホーム

Propositionizer Wiki Flan T5 Large

chentong00によって開発

このモデルはFlan-T5-Largeベースの命題分割モデルで、テキスト内容を独立した命題ユニットに分解します。

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #テキスト命題分割 #知識集約型検索 #JSON構造化出力

ダウンロード数 892

リリース時間 : 11/11/2023

モデル概要

このモデルは主に複雑なテキスト段落を短く独立した命題ユニットに分解し、情報検索と分析を容易にするために使用されます。

モデル特徴

テキスト命題分割

複雑なテキスト内容を独立した命題ユニットに分解でき、後続の処理と分析を容易にします。

構造化出力

JSON形式の命題リストを出力し、プログラム処理を容易にします。

マルチレベル入力サポート

タイトル、セクション、コンテンツのマルチレベル入力をサポートし、分割精度を向上させます。

モデル能力

テキスト分割

情報抽出

構造化出力

使用事例

情報検索

Wikipediaコンテンツ分析

Wikipedia記事を独立した命題に分解し、より細かい粒度の検索システム構築を容易にします。

検索システムの精度と再現率を向上させる

知識グラフ構築

知識ユニット抽出

テキストから独立した知識ユニットを抽出し、知識グラフ構築に使用します。

知識グラフの構築効率と品質を向上させる

🚀 命題分割モデル

このモデルは、Chenらによる論文 "Dense X Retrieval: What Retrieval Granularity Should We Use?" (2023年) から派生した命題分割モデルです。

🚀 クイックスタート

この命題分割モデルは、入力された文章を命題単位に分割します。モデルへの入力プロンプトは Title: {タイトル}. Section: {セクション}. Content: {内容} の形式で与えられ、出力はJSON形式の命題のリストとなります。

💻 使用例

基本的な使用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import json

model_name = "chentong00/propositionizer-wiki-flan-t5-large"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

title = "Leaning Tower of Pisa"
section = ""
content = "Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees, but the tower now leans at about 3.99 degrees. This means the top of the tower is displaced horizontally 3.9 meters (12 ft 10 in) from the center."

input_text = f"Title: {title}. Section: {section}. Content: {content}"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids.to(device), max_new_tokens=512).cpu()

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
try:
    prop_list = json.loads(output_text)
except:
    prop_list = []
    print("[ERROR] Failed to parse output text as JSON.")
print(json.dumps(prop_list, indent=2))

期待される出力

[
  "Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees.",
  "Leaning Tower of Pisa now leans at about 3.99 degrees.",
  "The top of Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center."
]

📄 ライセンス

このプロジェクトはApache-2.0ライセンスの下で提供されています。

📚 引用情報

@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Tong Chen and Hongwei Wang and Sihao Chen and Wenhao Yu and Kaixin Ma and Xinran Zhao and Hongming Zhang and Dong Yu},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023},
  URL = {https://arxiv.org/pdf/2312.06648.pdf}
}