đ KRISSBERT
KRISSBERT is a contextual encoder for entity linking. It addresses the challenges in entity linking by leveraging Knowledge-RIch Self-Supervision (KRISS) with readily available unlabeled text and domain knowledge. This model outperforms prior self - supervised methods in biomedical entity linking tasks.
đ Quick Start
The following steps show how to use KRISSBERT for entity linking with the MedMentions dataset.
đĻ Installation
1. Create conda environment and install requirements
conda create -n kriss -y python=3.8 && conda activate kriss
pip install -r requirements.txt
2. Switch the root dir to usage
cd usage
3. Download the MedMentions dataset
git clone https://github.com/chanzuckerberg/MedMentions.git
đģ Usage Examples
1. Generate prototype embeddings
python generate_prototypes.py
2. Run entity linking
python run_entity_linking.py
This will give you about 58.3%
top - 1 accuracy.
⨠Features
- Knowledge - Rich Self - Supervision: KRISSBERT leverages readily available unlabeled text and domain knowledge for self - supervision, which helps in handling entity linking challenges such as prolific variations and prevalent ambiguities.
- Context - Aware: Unlike some prior systems, KRISSBERT takes into account the context of an entity mention, enabling it to disambiguate ambiguous mentions more effectively.
- State - of - the - Art Performance: Experiments on seven standard biomedical entity linking datasets show that KRISSBERT attains a new state of the art, outperforming prior self - supervised methods by as much as 20 absolute points in accuracy.
đ Documentation
Entity linking faces significant challenges such as prolific variations and prevalent ambiguities, especially in high - value domains with myriad entities. Standard classification approaches suffer from the annotation bottleneck and cannot effectively handle unseen entities. Zero - shot entity linking has emerged as a promising direction for generalizing to new entities, but it still requires example gold entity mentions during training and canonical descriptions for all entities, both of which are rarely available outside of Wikipedia (Logeswaran et al., 2019; Wu et al., 2020).
Specifically, the KRISSBERT model is initialized with PubMedBERT parameters, and then continuously pretrained using biomedical entity names from the UMLS ontology to self - supervise entity linking examples from PubMed abstracts.
Some prior systems like BioSyn, SapBERT, and their follow - up work (e.g., Lai et al., 2021) claimed to do entity linking, but they completely ignore the context of an entity mention, and can only predict a surface form in the entity dictionary, not the canonical entity ID (e.g., CUI in UMLS). Therefore, they can't disambiguate ambiguous mentions.
đ§ Technical Details
The KRISSBERT model is initialized with the parameters of PubMedBERT. Then, it is continuously pretrained using biomedical entity names from the UMLS ontology. The self - supervision process uses entity linking examples from PubMed abstracts.
đ License
This project is licensed under the MIT license.
đ Citation
If you find KRISSBERT useful in your research, please cite the following paper:
@article{krissbert,
author = {Sheng Zhang, Hao Cheng, Shikhar Vashishth, Cliff Wong, Jinfeng Xiao, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, Hoifung Poon},
title = {Knowledge-Rich Self-Supervision for Biomedical Entity Linking},
year = {2021},
url = {https://arxiv.org/abs/2112.07887},
eprinttype = {arXiv},
eprint = {2112.07887},
}
Property |
Details |
Model Type |
Contextual encoder for entity linking |
Training Data |
Biomedical entity names from the UMLS ontology and entity linking examples from PubMed abstracts |