開源SciNCL模型 - 免費生成研究論文文檔級嵌入表示

首頁

Scincl

由malteos開發

SciNCL是一個預訓練的BERT語言模型，用於生成研究論文的文檔級嵌入表示，利用引文圖的鄰域關係進行對比學習訓練。

文本嵌入

Transformers

英語開源協議:MIT #科學文獻嵌入 #引文對比學習 #文檔級表示

下載量 6,744

發布時間 : 3/2/2022

模型概述

該模型專門用於科學文獻的嵌入表示生成，通過對比學習優化文檔級語義表示，適用於學術論文相似性計算和推薦系統。

模型特點

引文圖增強訓練

利用S2ORC引文圖的鄰域關係生成對比學習樣本，提升文檔表示質量

科學領域優化

專為科學文獻設計，在SciDocs評估基準上表現優異

雙文本編碼

支持標題與摘要的聯合編碼（通過[SEP]標記連接）

模型能力

科學文獻嵌入表示生成

文檔相似度計算

學術論文推薦

使用案例

學術研究

相關論文發現

通過嵌入相似度查找與給定論文相關的研究文獻

在SciDocs評估中引用關係任務達到93.6 map

學術推薦系統

構建基於內容的論文推薦系統

推薦任務達到54.3 ndcg

文獻分析

研究趨勢分析

通過大規模文獻嵌入聚類分析學科發展脈絡

🚀 SciNCL

SciNCL是一個預訓練的BERT語言模型，用於生成研究論文的文檔級嵌入向量。它利用引用圖鄰域生成樣本進行對比學習。在對比訓練之前，該模型使用來自scibert - scivocab - uncased的權重進行初始化。底層的引用嵌入向量是在S2ORC引用圖上進行訓練的。

論文：Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)。

代碼：https://github.com/malteos/scincl

PubMedNCL：如果處理生物醫學論文，可以嘗試使用PubMedNCL。

🚀 快速開始

✨ 主要特性

基於預訓練的BERT模型，可生成研究論文的文檔級嵌入向量。
利用引用圖鄰域進行對比學習。
模型初始化使用了預訓練的scibert - scivocab - uncased權重。

📦 安裝指南

文檔未提及安裝步驟，跳過此章節。

💻 使用示例

基礎用法

使用Sentence Transformers庫調用預訓練模型：

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

高級用法

使用Transformers庫調用預訓練模型：

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

# calculate the similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
# => 0.8440518379211426

📚 詳細文檔

三元組挖掘參數

屬性	詳情
seed	4
triples_per_query	5
easy_positives_count	5
easy_positives_strategy	5
easy_positives_k	20 - 25
easy_negatives_count	3
easy_negatives_strategy	random_without_knn
hard_negatives_count	2
hard_negatives_strategy	knn
hard_negatives_k	3998 - 4000

SciDocs結果

這些模型權重是在SciDocs上取得最佳結果的權重（seed = 4）。在論文中，我們報告的SciDocs結果是十個隨機種子的平均值。

模型	mag - f1	mesh - f1	co - view - map	co - view - ndcg	co - read - map	co - read - ndcg	cite - map	cite - ndcg	cocite - map	cocite - ndcg	recomm - ndcg	recomm - P@1	平均值
Doc2Vec	66.2	69.2	67.8	82.9	64.9	81.6	65.3	82.2	67.1	83.4	51.7	16.9	66.6
fasttext - sum	78.1	84.1	76.5	87.9	75.3	87.4	74.6	88.1	77.8	89.6	52.5	18	74.1
SGC	76.8	82.7	77.2	88	75.7	87.5	91.6	96.2	84.1	92.5	52.7	18.2	76.9
SciBERT	79.7	80.7	50.7	73.1	47.7	71.1	48.3	71.7	49.7	72.6	52.1	17.9	59.6
SPECTER	82	86.4	83.6	91.5	84.5	92.4	88.3	94.9	88.1	94.8	53.9	20	80
SciNCL (10 seeds)	81.4	88.7	85.3	92.3	87.5	93.9	93.6	97.3	91.6	96.4	53.9	19.3	81.8
SciNCL (seed = 4)	81.2	89.0	85.3	92.2	87.7	94.0	93.6	97.4	91.7	96.5	54.3	19.6	81.9