BioLinkBERT-large Open-source Biomedical Language Model - Integrating Knowledge to Enhance Medical Information Processing Performance

Biolinkbert Large

Developed by michiyasunaga

BioLinkBERT is a biomedical language model pre-trained on PubMed abstracts and citation links, enhancing performance through cross-document knowledge integration.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Biomedical Text Processing #Cross-document Knowledge Integration #High-precision QA Systems

Downloads 3,152

Release Time : 3/8/2022

Model Overview

An improved BERT model that captures cross-document relationships using document links (e.g., citations), optimized for biomedical NLP tasks and achieving SOTA performance on multiple benchmarks.

Model Features

Cross-document Knowledge Integration

Enhances contextual understanding by jointly processing related documents through citation links.

Biomedical Domain Optimization

Pre-trained on PubMed data, specifically designed for biomedical text processing.

Multi-task Adaptability

Supports fine-tuning for various downstream tasks like QA and classification, or direct use for feature extraction.

Model Capabilities

Biomedical Text Understanding

Cross-document Relationship Analysis

Question Answering System Construction

Text Classification

Sequence Labeling

Feature Vector Extraction

Use Cases

Medical Research

Drug Mechanism Analysis

Analyzes text describing drug targets and mechanisms of action.

Achieves 72.2% accuracy on PubMedQA tasks.

Clinical Decision Support

Medical Exam QA

Answers USMLE medical licensing exam questions.

44.6% accuracy on MedQA-USMLE, surpassing same-scale models.

🚀 BioLinkBERT-large

BioLinkBERT-large is a pre-trained model on PubMed abstracts with citation link information. It offers advanced performance in biomedical NLP tasks.

🚀 Quick Start

BioLinkBERT-large model is pretrained on PubMed abstracts along with citation link information. It is introduced in the paper LinkBERT: Pretraining Language Models with Document Links (ACL 2022). The code and data are available in this repository.

This model achieves state-of-the-art performance on several biomedical NLP benchmarks such as BLURB and MedQA-USMLE.

✨ Features

Pretrained on a large corpus of documents with document links captured.
Can be a drop-in replacement for BERT, achieving better performance in general language understanding tasks.
Particularly effective for knowledge-intensive and cross-document tasks.

📦 Installation

The installation mainly involves using the transformers library. You can install it via pip install transformers if you haven't.

💻 Usage Examples

Basic Usage

To use the model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/BioLinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/BioLinkBERT-large')
inputs = tokenizer("Sunitinib is a tyrosine kinase inhibitor", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Advanced Usage

For fine-tuning, you can use this repository or follow any other BERT fine-tuning codebases.

📚 Documentation

Model description

LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures document links such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for knowledge-intensive tasks (e.g. question answering) and cross-document tasks (e.g. reading comprehension, document retrieval).

Intended uses & limitations

The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification. You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).

🔧 Technical Details

LinkBERT is pretrained by incorporating document links into the training process, which is an innovative approach compared to traditional BERT. By feeding linked documents into the same language model context, it can capture knowledge across multiple documents, thus enhancing its performance in various tasks.

📄 License

This model is licensed under the Apache-2.0 license.

Evaluation results

When fine-tuned on downstream tasks, LinkBERT achieves the following results.

Biomedical benchmarks (BLURB, MedQA, MMLU, etc.): BioLinkBERT attains new state-of-the-art.

	BLURB score	PubMedQA	BioASQ	MedQA-USMLE
PubmedBERT-base	81.10	55.8	87.5	38.1
BioLinkBERT-base	83.39	70.2	91.4	40.0
BioLinkBERT-large	84.30	72.2	94.8	44.6

	MMLU-professional medicine
GPT-3 (175 params)	38.7
UnifiedQA (11B params)	43.2
BioLinkBERT-large (340M params)	50.7

Citation

If you find LinkBERT useful in your project, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}

Additional Information

Property	Details
Model Type	BioLinkBERT-large
Training Data	PubMed abstracts with citation link information

⚠️ Important Note

The model should be used in compliance with the Apache-2.0 license.

💡 Usage Tip

For better performance, fine-tuning on your specific downstream task is recommended.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご