开源SciNCL模型 - 免费生成研究论文文档级嵌入表示

首页

Scincl

由 malteos 开发

SciNCL是一个预训练的BERT语言模型，用于生成研究论文的文档级嵌入表示，利用引文图的邻域关系进行对比学习训练。

文本嵌入

Transformers

英语开源协议:MIT #科学文献嵌入 #引文对比学习 #文档级表示

下载量 6,744

发布时间 : 3/2/2022

模型简介

该模型专门用于科学文献的嵌入表示生成，通过对比学习优化文档级语义表示，适用于学术论文相似性计算和推荐系统。

模型特点

引文图增强训练

利用S2ORC引文图的邻域关系生成对比学习样本，提升文档表示质量

科学领域优化

专为科学文献设计，在SciDocs评估基准上表现优异

双文本编码

支持标题与摘要的联合编码（通过[SEP]标记连接）

模型能力

科学文献嵌入表示生成

文档相似度计算

学术论文推荐

使用案例

学术研究

🚀 SciNCL

SciNCL是一个预训练的BERT语言模型，用于生成研究论文的文档级嵌入向量。它利用引用图邻域生成样本进行对比学习。在对比训练之前，该模型使用来自scibert - scivocab - uncased的权重进行初始化。底层的引用嵌入向量是在S2ORC引用图上进行训练的。

论文：Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)。

代码：https://github.com/malteos/scincl

PubMedNCL：如果处理生物医学论文，可以尝试使用PubMedNCL。

🚀 快速开始

✨ 主要特性

基于预训练的BERT模型，可生成研究论文的文档级嵌入向量。
利用引用图邻域进行对比学习。
模型初始化使用了预训练的scibert - scivocab - uncased权重。

📦 安装指南

文档未提及安装步骤，跳过此章节。

💻 使用示例

基础用法

使用Sentence Transformers库调用预训练模型：

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

高级用法

使用Transformers库调用预训练模型：

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

# calculate the similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
# => 0.8440518379211426

📚 详细文档

三元组挖掘参数

属性	详情
seed	4
triples_per_query	5
easy_positives_count	5
easy_positives_strategy	5
easy_positives_k	20 - 25
easy_negatives_count	3
easy_negatives_strategy	random_without_knn
hard_negatives_count	2
hard_negatives_strategy	knn
hard_negatives_k	3998 - 4000

SciDocs结果

这些模型权重是在SciDocs上取得最佳结果的权重（seed = 4）。在论文中，我们报告的SciDocs结果是十个随机种子的平均值。

模型	mag - f1	mesh - f1	co - view - map	co - view - ndcg	co - read - map	co - read - ndcg	cite - map	cite - ndcg	cocite - map	cocite - ndcg	recomm - ndcg	recomm - P@1	平均值
Doc2Vec	66.2	69.2	67.8	82.9	64.9	81.6	65.3	82.2	67.1	83.4	51.7	16.9	66.6
fasttext - sum	78.1	84.1	76.5	87.9	75.3	87.4	74.6	88.1	77.8	89.6	52.5	18	74.1
SGC	76.8	82.7	77.2	88	75.7	87.5	91.6	96.2	84.1	92.5	52.7	18.2	76.9
SciBERT	79.7	80.7	50.7	73.1	47.7	71.1	48.3	71.7	49.7	72.6	52.1	17.9	59.6
SPECTER	82	86.4	83.6	91.5	84.5	92.4	88.3	94.9	88.1	94.8	53.9	20	80
SciNCL (10 seeds)	81.4	88.7	85.3	92.3	87.5	93.9	93.6	97.3	91.6	96.4	53.9	19.3	81.8
SciNCL (seed = 4)	81.2	89.0	85.3	92.2	87.7	94.0	93.6	97.4	91.7	96.5	54.3	19.6	81.9