🚀 SciNCL
SciNCL是一個預訓練的BERT語言模型,用於生成研究論文的文檔級嵌入向量。它利用引用圖鄰域生成樣本進行對比學習。在對比訓練之前,該模型使用來自scibert - scivocab - uncased的權重進行初始化。底層的引用嵌入向量是在S2ORC引用圖上進行訓練的。
論文:Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)。
代碼:https://github.com/malteos/scincl
PubMedNCL:如果處理生物醫學論文,可以嘗試使用PubMedNCL。
🚀 快速開始
✨ 主要特性
- 基於預訓練的BERT模型,可生成研究論文的文檔級嵌入向量。
- 利用引用圖鄰域進行對比學習。
- 模型初始化使用了預訓練的
scibert - scivocab - uncased
權重。
📦 安裝指南
文檔未提及安裝步驟,跳過此章節。
💻 使用示例
基礎用法
使用Sentence Transformers
庫調用預訓練模型:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("malteos/scincl")
papers = [
"BERT [SEP] We introduce a new language representation model called BERT",
"Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
embeddings = model.encode(papers)
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
高級用法
使用Transformers
庫調用預訓練模型:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embeddings = result.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
📚 詳細文檔
三元組挖掘參數
屬性 |
詳情 |
seed |
4 |
triples_per_query |
5 |
easy_positives_count |
5 |
easy_positives_strategy |
5 |
easy_positives_k |
20 - 25 |
easy_negatives_count |
3 |
easy_negatives_strategy |
random_without_knn |
hard_negatives_count |
2 |
hard_negatives_strategy |
knn |
hard_negatives_k |
3998 - 4000 |
SciDocs結果
這些模型權重是在SciDocs上取得最佳結果的權重(seed = 4
)。在論文中,我們報告的SciDocs結果是十個隨機種子的平均值。
模型 |
mag - f1 |
mesh - f1 |
co - view - map |
co - view - ndcg |
co - read - map |
co - read - ndcg |
cite - map |
cite - ndcg |
cocite - map |
cocite - ndcg |
recomm - ndcg |
recomm - P@1 |
平均值 |
Doc2Vec |
66.2 |
69.2 |
67.8 |
82.9 |
64.9 |
81.6 |
65.3 |
82.2 |
67.1 |
83.4 |
51.7 |
16.9 |
66.6 |
fasttext - sum |
78.1 |
84.1 |
76.5 |
87.9 |
75.3 |
87.4 |
74.6 |
88.1 |
77.8 |
89.6 |
52.5 |
18 |
74.1 |
SGC |
76.8 |
82.7 |
77.2 |
88 |
75.7 |
87.5 |
91.6 |
96.2 |
84.1 |
92.5 |
52.7 |
18.2 |
76.9 |
SciBERT |
79.7 |
80.7 |
50.7 |
73.1 |
47.7 |
71.1 |
48.3 |
71.7 |
49.7 |
72.6 |
52.1 |
17.9 |
59.6 |
SPECTER |
82 |
86.4 |
83.6 |
91.5 |
84.5 |
92.4 |
88.3 |
94.9 |
88.1 |
94.8 |
53.9 |
20 |
80 |
SciNCL (10 seeds) |
81.4 |
88.7 |
85.3 |
92.3 |
87.5 |
93.9 |
93.6 |
97.3 |
91.6 |
96.4 |
53.9 |
19.3 |
81.8 |
SciNCL (seed = 4) |
81.2 |
89.0 |
85.3 |
92.2 |
87.7 |
94.0 |
93.6 |
97.4 |
91.7 |
96.5 |
54.3 |
19.6 |
81.9 |
論文中提供了更多的評估結果。
📄 許可證
本項目採用MIT許可證。