Genre-linking-blink Open-source Entity Retrieval System - Free Deployment and Precise Generation of Unique Entity Names

Genre Linking Blink

Developed by facebook

GENRE is a sequence-to-sequence-based entity retrieval system that employs a fine-tuned BART architecture and generates unique entity names through constrained beam search technology.

Knowledge Graph English#Generative Entity Linking #Wikipedia Disambiguation #Autoregressive Retrieval

Downloads 671

Release Time : 6/7/2022

Model Overview

The GENRE system is used for entity retrieval and linking tasks, capable of efficiently disambiguating entities by generating unique entity names.

Model Features

Autoregressive Entity Retrieval

Uses a sequence-to-sequence approach for entity retrieval, achieving efficient linking by generating unique entity names.

Constrained Beam Search

Ensures generated outputs are valid entity identifiers, improving retrieval accuracy.

Large-scale Training Data

Trained on the complete BLINK training set (9 million Wikipedia entity disambiguation data points).

Model Capabilities

Entity Retrieval

Named Entity Linking

Entity Disambiguation

Text Generation

Use Cases

Knowledge Base Linking

Wikipedia Page Disambiguation

Links entities in text to Wikipedia pages, resolving ambiguity for entities with the same name.

Outputs top 5 predictions, e.g., ['Germans', 'Germany', 'German Empire', 'Weimar Republic', 'Greeks']

Information Retrieval

Document Entity Linking

Identifies and links entities in documents to corresponding entries in a knowledge base.

🚀 GENRE

The GENRE (Generative ENtity REtrieval) system is a powerful tool for entity retrieval tasks, leveraging a sequence-to-sequence approach based on the fine - tuned BART architecture.

The GENRE (Generative ENtity REtrieval) system, as introduced in Autoregressive Entity Retrieval, is implemented in PyTorch. In essence, GENRE adopts a sequence - to - sequence method for entity retrieval (such as entity linking). It is built upon the fine - tuned BART architecture. GENRE conducts retrieval by generating the unique entity name conditioned on the input text, using constrained beam search to ensure only valid identifiers are generated. The model was initially released in the facebookresearch/GENRE repository with fairseq. The transformers models are obtained through a conversion script similar to this.

This model was trained on the full training set of BLINK, which consists of 9M datapoints for entity - disambiguation grounded on Wikipedia.

📚 Documentation

BibTeX entry and citation info

Please consider citing our works if you use code from this repository.

@inproceedings{decao2020autoregressive,
  title={Autoregressive Entity Retrieval},
  author={Nicola {De Cao} and Gautier Izacard and Sebastian Riedel and Fabio Petroni},
  booktitle={International Conference on Learning Representations},
  url={https://openreview.net/forum?id=5k8F6UU39V},
  year={2021}
}

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# OPTIONAL: load the prefix tree (trie), you need to additionally download
# https://huggingface.co/facebook/genre-linking-blink/blob/main/trie.py and 
# https://huggingface.co/facebook/genre-linking-blink/blob/main/kilt_titles_trie_dict.pkl
# import pickle
# from trie import Trie
# with open("kilt_titles_trie_dict.pkl", "rb") as f:
#     trie = Trie.load_from_dict(pickle.load(f))

tokenizer = AutoTokenizer.from_pretrained("facebook/genre-linking-blink")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/genre-linking-blink").eval()

sentences = ["Einstein was a [START_ENT] German [END_ENT] physicist."]

outputs = model.generate(
    **tokenizer(sentences, return_tensors="pt"),
    num_beams=5,
    num_return_sequences=5,
    # OPTIONAL: use constrained beam search
    # prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

This code outputs the following top - 5 predictions (using constrained beam search):

['Germans',
 'Germany',
 'German Empire',
 'Weimar Republic',
 'Greeks']

📄 License

Since no license information is provided in the original README, this section is skipped.

🔧 Technical Details

The GENRE system uses a sequence - to - sequence approach for entity retrieval. It is based on the fine - tuned BART architecture. The model was trained on the full training set of BLINK, which contains 9M datapoints for entity - disambiguation grounded on Wikipedia. It uses constrained beam search to generate valid entity identifiers conditioned on the input text.

Information Table

Property	Details
Model Type	Generative ENtity REtrieval (GENRE) based on fine - tuned BART architecture
Training Data	Full training set of BLINK with 9M datapoints for entity - disambiguation grounded on Wikipedia

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご