オープンソースのrebel-largeモデル - エンドツーエンドで関係を抽出し、200種類以上の異なる関係タイプをサポート

ホーム

Rebel Large

Babelscapeによって開発

REBELは、BARTベースのシーケンス-to-シーケンスモデルで、エンドツーエンドの関係抽出に使用され、200種類以上の異なる関係タイプをサポートします。

知識グラフ

Transformers

英語#エンドツーエンドの関係抽出 #シーケンス-to-シーケンスモデル #複数の関係タイプのサポート

ダウンロード数 37.57k

リリース時間 : 3/2/2022

モデル概要

REBELは、関係抽出をシーケンス-to-シーケンスタスクとして再定義することで、生テキストから関係三元組を抽出するプロセスを簡素化します。自己回帰型のシーケンス-to-シーケンスモデルを使用して、テキストから直接関係三元組を抽出でき、知識ベースの充填や事実検証などのさまざまなアプリケーションに対応しています。

モデル特徴

エンドツーエンドの関係抽出

関係抽出タスクをシーケンス-to-シーケンスタスクに簡素化し、テキストから直接関係三元組を生成します。

複数の関係タイプのサポート

200種類以上の異なる関係タイプをサポートし、幅広い情報抽出シナリオに適しています。

高性能

複数の関係抽出ベンチマークテストで最先端の性能を達成しています。

モデル能力

関係抽出

エンティティ関係識別

知識ベースの充填

使用事例

知識ベースの構築

知識ベースの充填

非構造化テキストから関係三元組を抽出し、知識ベースの充填または検証に使用します。

知識ベースのカバレッジと精度を向上させます。

情報抽出

事実検証

テキストから関係三元組を抽出し、事実の正確性を検証するために使用します。

自動化された事実検証プロセスをサポートします。

🚀 REBEL : エンドツーエンド言語生成による関係抽出

このモデルは、EMNLP 2021の論文「REBEL: Relation Extraction By End-to-end Language generation」で発表されたものです。新しい線形化アプローチを提案し、関係抽出をseq2seqタスクとして再構築しています。論文はこちらで確認できます。コードを使用する場合は、以下のようにこの研究を引用してください。

@inproceedings{huguet-cabot-navigli-2021-rebel-relation,
    title = "{REBEL}: Relation Extraction By End-to-end Language generation",
    author = "Huguet Cabot, Pere-Llu{\'\i}s  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.204",
    pages = "2370--2381",
    abstract = "Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model{'}s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.",
}

論文の元のリポジトリはこちらです。

右側の推論ウィジェットは、主語、目的語、関係タイプを区別するために必要な特殊トークンを出力しません。REBELのデモとその事前学習データセットについては、Spacesデモを確認してください。

🚀 クイックスタート

このセクションでは、REBELモデルを使用して関係抽出を行う基本的な手順を説明します。

✨ 主な機能

200種類以上の異なる関係タイプに対してエンドツーエンドの関係抽出を行うことができます。
複数の関係抽出および関係分類のベンチマークで最先端の性能を達成しています。

📦 インストール

このモデルを使用するには、transformersライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

💻 使用例

基本的な使用法

from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# 特殊トークンが必要なので、手動でトークナイザーを使用します。
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("Punta Cana is a resort town in the municipality of Higuey, in La Altagracia Province, the eastern most province of the Dominican Republic", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
# 生成されたテキストを解析してトリプレットを抽出する関数
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)

高度な使用法

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

# モデルとトークナイザーをロード
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 3,
}

# トリプレットを抽出するテキスト
text = 'Punta Cana is a resort town in the municipality of Higüey, in La Altagracia Province, the easternmost province of the Dominican Republic.'

# テキストをトークナイズ
model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt')

# 生成
generated_tokens = model.generate(
    model_inputs["input_ids"].to(model.device),
    attention_mask=model_inputs["attention_mask"].to(model.device),
    **gen_kwargs,
)

# テキストを抽出
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

# トリプレットを抽出
for idx, sentence in enumerate(decoded_preds):
    print(f'Prediction triplets sentence {idx}')
    print(extract_triplets(sentence))