LinkBERT-large Open-source Model - Enhance Cross-document Knowledge Understanding and Freely Assist in Document Reading and Analysis

Linkbert Large

Developed by michiyasunaga

LinkBERT-large is an improved BERT model pre-trained on English Wikipedia and book corpora, enhancing cross-document knowledge understanding by integrating document link information.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Cross-document understanding #Knowledge-enhanced pre-training #Hyperlink-aware

Downloads 2,042

Release Time : 3/8/2022

Model Overview

This model improves the traditional BERT architecture by capturing hyperlink relationships between documents, excelling in knowledge-intensive tasks (e.g., question answering) and cross-document tasks, and can directly replace BERT.

Model Features

Cross-document link modeling

Innovatively incorporates related documents into pre-training contexts, capturing cross-document knowledge relationships through hyperlinks.

Knowledge-enhanced representations

Trained on structured Wikipedia data, generating text embeddings rich in entity relationships.

BERT ecosystem compatibility

Can directly replace existing BERT models without modifying downstream task architectures.

Model Capabilities

Text feature extraction

Masked language modeling

Question answering system construction

Text classification

Sequence labeling

Use Cases

Knowledge-intensive tasks

Open-domain question answering

Handles complex questions requiring cross-document knowledge.

Achieves an F1 score of 80.8 on HotpotQA, surpassing BERT-large's 78.1.

Information retrieval

Document association analysis

Enhances document similarity calculations using link information.

🚀 LinkBERT-large

LinkBERT-large is a pre - trained model leveraging English Wikipedia articles and hyperlink information. It offers enhanced performance in various NLP tasks, especially knowledge - intensive and cross - document ones.

🚀 Quick Start

LinkBERT-large is a pre - trained model on English Wikipedia articles with hyperlink information. It was introduced in the paper LinkBERT: Pretraining Language Models with Document Links (ACL 2022). The code and data can be found in this repository.

✨ Features

Document Link Awareness: LinkBERT is an improved transformer encoder (similar to BERT) that captures document links like hyperlinks and citation links, incorporating knowledge across multiple documents.
Versatile Application: It can serve as a drop - in replacement for BERT, performing well in general language understanding tasks, and excelling in knowledge - intensive and cross - document tasks.

📦 Installation

This section is not explicitly provided in the original README. Since there are no specific installation steps, we skip this section.

💻 Usage Examples

Basic Usage

To use the model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Advanced Usage

For fine - tuning, you can use this repository or follow any other BERT fine - tuning codebases.

📚 Documentation

Model description

LinkBERT is a transformer encoder (BERT - like) model pre - trained on a large corpus of documents. It enhances BERT by capturing document links such as hyperlinks and citation links, integrating knowledge across multiple documents. Specifically, it was pre - trained by including linked documents in the same language model context, in addition to single documents.

LinkBERT can replace BERT directly. It performs better in general language understanding tasks (e.g., text classification), and is particularly effective for knowledge - intensive tasks (e.g., question answering) and cross - document tasks (e.g., reading comprehension, document retrieval).

Intended uses & limitations

The model can be fine - tuned for downstream tasks like question answering, sequence classification, and token classification. You can also use the raw model for feature extraction (i.e., obtaining embeddings for input text).

🔧 Technical Details

Evaluation results

When fine - tuned on downstream tasks, LinkBERT achieves the following results.

General benchmarks (MRQA and GLUE):

Property	Details
Model Type	LinkBERT-large
Training Data	English Wikipedia articles with hyperlink information

	HotpotQA	TriviaQA	SearchQA	NaturalQ	NewsQA	SQuAD	GLUE
	F1	F1	F1	F1	F1	F1	Avg score
BERT-base	76.0	70.3	74.2	76.5	65.7	88.7	79.2
LinkBERT-base	78.2	73.9	76.8	78.3	69.3	90.1	79.6
BERT-large	78.1	73.7	78.3	79.0	70.9	91.1	80.7
LinkBERT-large	80.8	78.2	80.5	81.0	72.6	92.7	81.1

📄 License

The model is licensed under the Apache 2.0 license.

📖 Citation

If you find LinkBERT useful in your project, please cite the following:

@InProceedings{yasunaga2022linkbert,
  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
  title =   {LinkBERT: Pretraining Language Models with Document Links},
  year =    {2022},  
  booktitle = {Association for Computational Linguistics (ACL)},  
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご