Open-source SciNCL Model - Generate Document-level Embeddings for Research Papers for Free

Scincl

Developed by malteos

SciNCL is a pre-trained BERT language model for generating document-level embedding representations of research papers, trained using contrastive learning with citation graph neighborhood relationships.

Text Embedding

Transformers

EnglishOpen Source License:MIT #Scientific Literature Embedding #Citation Contrastive Learning #Document-level Representation

Downloads 6,744

Release Time : 3/2/2022

Model Overview

This model is specifically designed for generating embedding representations of scientific literature, optimizing document-level semantic representations through contrastive learning, suitable for academic paper similarity computation and recommendation systems.

Model Features

Citation Graph Enhanced Training

Utilizes neighborhood relationships from the S2ORC citation graph to generate contrastive learning samples, improving document representation quality.

Scientific Domain Optimization

Designed specifically for scientific literature, demonstrating excellent performance on the SciDocs evaluation benchmark.

Dual Text Encoding

Supports joint encoding of titles and abstracts (connected via [SEP] token).

Model Capabilities

Scientific literature embedding representation generation

Document similarity computation

Academic paper recommendation

Use Cases

Academic Research

🚀 SciNCL

SciNCL is a pre - trained BERT language model designed to generate document - level embeddings of research papers. It leverages the citation graph neighborhood for contrastive learning, offering valuable insights for scientific document analysis.

🚀 Quick Start

SciNCL is a pre - trained BERT language model to generate document - level embeddings of research papers. It uses the citation graph neighborhood to generate samples for contrastive learning. Prior to the contrastive training, the model is initialized with weights from scibert - scivocab - uncased. The underlying citation embeddings are trained on the S2ORC citation graph.

Paper: Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper).

Code: https://github.com/malteos/scincl

PubMedNCL: Working with biomedical papers? Try PubMedNCL.

✨ Features

Document - level Embeddings: Capable of generating high - quality document - level embeddings for research papers.
Contrastive Learning: Utilizes citation graph neighborhood for contrastive learning, enhancing the model's performance.
Pre - trained Weights: Initialized with weights from scibert - scivocab - uncased for better generalization.

📦 Installation

The installation process is mainly about loading the model through the sentence - transformers or transformers library. There is no specific installation command here, but you need to ensure that the relevant libraries are installed. You can use the following command to install the necessary libraries:

pip install sentence-transformers transformers

💻 Usage Examples

Basic Usage

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

Transformers

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

# calculate the similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
# => 0.8440518379211426

📚 Documentation

Triplet Mining Parameters

Property	Details
seed	4
triples_per_query	5
easy_positives_count	5
easy_positives_strategy	5
easy_positives_k	20 - 25
easy_negatives_count	3
easy_negatives_strategy	random_without_knn
hard_negatives_count	2
hard_negatives_strategy	knn
hard_negatives_k	3998 - 4000

SciDocs Results

These model weights are the ones that yielded the best results on SciDocs (seed = 4). In the paper, the SciDocs results are reported as the mean over ten seeds.

model	mag - f1	mesh - f1	co - view - map	co - view - ndcg	co - read - map	co - read - ndcg	cite - map	cite - ndcg	cocite - map	cocite - ndcg	recomm - ndcg	recomm - P@1	Avg
Doc2Vec	66.2	69.2	67.8	82.9	64.9	81.6	65.3	82.2	67.1	83.4	51.7	16.9	66.6
fasttext - sum	78.1	84.1	76.5	87.9	75.3	87.4	74.6	88.1	77.8	89.6	52.5	18	74.1
SGC	76.8	82.7	77.2	88	75.7	87.5	91.6	96.2	84.1	92.5	52.7	18.2	76.9
SciBERT	79.7	80.7	50.7	73.1	47.7	71.1	48.3	71.7	49.7	72.6	52.1	17.9	59.6
SPECTER	82	86.4	83.6	91.5	84.5	92.4	88.3	94.9	88.1	94.8	53.9	20	80
SciNCL (10 seeds)	81.4	88.7	85.3	92.3	87.5	93.9	93.6	97.3	91.6	96.4	53.9	19.3	81.8
SciNCL (seed = 4)	81.2	89.0	85.3	92.2	87.7	94.0	93.6	97.4	91.7	96.5	54.3	19.6	81.9

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご