đ SciNCL
SciNCL is a pre - trained BERT language model designed to generate document - level embeddings of research papers. It leverages the citation graph neighborhood for contrastive learning, offering valuable insights for scientific document analysis.
đ Quick Start
SciNCL is a pre - trained BERT language model to generate document - level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert - scivocab - uncased. The underlying citation embeddings are trained on the S2ORC citation graph.
Paper: Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper).
Code: https://github.com/malteos/scincl
PubMedNCL: Working with biomedical papers? Try PubMedNCL.
⨠Features
- Document - level Embeddings: Capable of generating high - quality document - level embeddings for research papers.
- Contrastive Learning: Utilizes citation graph neighborhood for contrastive learning, enhancing the model's performance.
- Pre - trained Weights: Initialized with weights from scibert - scivocab - uncased for better generalization.
đĻ Installation
The installation process is mainly about loading the model through the sentence - transformers
or transformers
library. There is no specific installation command here, but you need to ensure that the relevant libraries are installed. You can use the following command to install the necessary libraries:
pip install sentence-transformers transformers
đģ Usage Examples
Basic Usage
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("malteos/scincl")
papers = [
"BERT [SEP] We introduce a new language representation model called BERT",
"Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
embeddings = model.encode(papers)
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
Transformers
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embeddings = result.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
đ Documentation
Triplet Mining Parameters
Property |
Details |
seed |
4 |
triples_per_query |
5 |
easy_positives_count |
5 |
easy_positives_strategy |
5 |
easy_positives_k |
20 - 25 |
easy_negatives_count |
3 |
easy_negatives_strategy |
random_without_knn |
hard_negatives_count |
2 |
hard_negatives_strategy |
knn |
hard_negatives_k |
3998 - 4000 |
SciDocs Results
These model weights are the ones that yielded the best results on SciDocs (seed = 4
). In the paper, the SciDocs results are reported as the mean over ten seeds.
model |
mag - f1 |
mesh - f1 |
co - view - map |
co - view - ndcg |
co - read - map |
co - read - ndcg |
cite - map |
cite - ndcg |
cocite - map |
cocite - ndcg |
recomm - ndcg |
recomm - P@1 |
Avg |
Doc2Vec |
66.2 |
69.2 |
67.8 |
82.9 |
64.9 |
81.6 |
65.3 |
82.2 |
67.1 |
83.4 |
51.7 |
16.9 |
66.6 |
fasttext - sum |
78.1 |
84.1 |
76.5 |
87.9 |
75.3 |
87.4 |
74.6 |
88.1 |
77.8 |
89.6 |
52.5 |
18 |
74.1 |
SGC |
76.8 |
82.7 |
77.2 |
88 |
75.7 |
87.5 |
91.6 |
96.2 |
84.1 |
92.5 |
52.7 |
18.2 |
76.9 |
SciBERT |
79.7 |
80.7 |
50.7 |
73.1 |
47.7 |
71.1 |
48.3 |
71.7 |
49.7 |
72.6 |
52.1 |
17.9 |
59.6 |
SPECTER |
82 |
86.4 |
83.6 |
91.5 |
84.5 |
92.4 |
88.3 |
94.9 |
88.1 |
94.8 |
53.9 |
20 |
80 |
SciNCL (10 seeds) |
81.4 |
88.7 |
85.3 |
92.3 |
87.5 |
93.9 |
93.6 |
97.3 |
91.6 |
96.4 |
53.9 |
19.3 |
81.8 |
SciNCL (seed = 4) |
81.2 |
89.0 |
85.3 |
92.2 |
87.7 |
94.0 |
93.6 |
97.4 |
91.7 |
96.5 |
54.3 |
19.6 |
81.9 |
đ License
This project is licensed under the MIT license.