開源rebel-large模型 - 端到端抽取關係，支持超200種不同關係類型

首頁

Rebel Large

由Babelscape開發

REBEL是一種基於BART的序列到序列模型，用於端到端關係抽取，支持200多種不同關係類型。

知識圖譜

Transformers

英語#端到端關係抽取 #序列到序列模型 #多關係類型支持

下載量 37.57k

發布時間 : 3/2/2022

模型概述

REBEL通過將關係抽取重新定義為序列到序列任務，簡化了從原始文本中提取關係三元組的過程。它使用自迴歸序列到序列模型，能夠直接從文本中提取關係三元組，支持多種應用如知識庫填充和事實核查。

模型特點

端到端關係抽取

將關係抽取任務簡化為序列到序列任務，直接從文本中生成關係三元組。

多關係類型支持

支持200多種不同關係類型，適用於廣泛的信息抽取場景。

高性能

在多個關係抽取基準測試中達到最先進的性能。

模型能力

關係抽取

實體關係識別

知識庫填充

使用案例

知識庫構建

知識庫填充

從非結構化文本中提取關係三元組，用於填充或驗證知識庫。

提高知識庫的覆蓋率和準確性。

信息抽取

事實核查

從文本中提取關係三元組，用於驗證事實的準確性。

支持自動化事實核查流程。

🚀 REBEL ：端到端語言生成的關係抽取

REBEL提出了一種新的線性化方法，並將關係抽取重新定義為一個序列到序列（seq2seq）任務。該模型可用於從原始文本中抽取關係三元組，適用於知識圖譜填充、事實核查等多個下游任務。

多語言更新！查看 mREBEL，這是一個多語言版本，涵蓋更多關係類型、語言，幷包含實體類型。

✨ 主要特性

新的線性化方法：提出了一種新的線性化方法，將關係三元組表示為文本序列，簡化了關係抽取任務。
端到端關係抽取：基於BART的seq2seq模型，可進行端到端的關係抽取，支持200多種不同的關係類型。
靈活性高：在多個關係抽取和關係分類基準測試上進行微調，在大多數基準測試中達到了最先進的性能。

📚 詳細文檔

這是2021年EMNLP會議論文 REBEL: Relation Extraction By End-to-end Language generation 的模型卡片。如果您使用了相關代碼，請在論文中引用這項工作：

@inproceedings{huguet-cabot-navigli-2021-rebel-relation,
    title = "{REBEL}: Relation Extraction By End-to-end Language generation",
    author = "Huguet Cabot, Pere-Llu{\'\i}s  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.204",
    pages = "2370--2381",
    abstract = "Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model{'}s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.",
}

論文的原始倉庫可以在這裡找到。

請注意，右側的推理小部件不會輸出特殊標記，這些標記對於區分主語、賓語和關係類型是必要的。有關REBEL及其預訓練數據集的演示，請查看 Spaces演示。

💻 使用示例

基礎用法

from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("Punta Cana is a resort town in the municipality of Higuey, in La Altagracia Province, the eastern most province of the Dominican Republic", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)

高級用法

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 3,
}

# Text to extract triplets from
text = 'Punta Cana is a resort town in the municipality of Higüey, in La Altagracia Province, the easternmost province of the Dominican Republic.'

# Tokenizer text
model_inputs = tokenizer(text, max_length=256, padding=True, truncation=True, return_tensors = 'pt')

# Generate
generated_tokens = model.generate(
    model_inputs["input_ids"].to(model.device),
    attention_mask=model_inputs["attention_mask"].to(model.device),
    **gen_kwargs,
)

# Extract text
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

# Extract triplets
for idx, sentence in enumerate(decoded_preds):
    print(f'Prediction triplets sentence {idx}')
    print(extract_triplets(sentence))

📄 許可證

本項目採用 cc-by-nc-sa-4.0 許可證。

📦 模型信息

屬性	詳情
模型類型	seq2seq
訓練數據	Babelscape/rebel-dataset
任務類型	關係抽取
評估數據集	CoNLL04、NYT
CoNLL04指標	RE+ Macro F1：76.65
NYT指標	F1：93.4