🚀 SciNCL
SciNCL是一个预训练的BERT语言模型,用于生成研究论文的文档级嵌入向量。它利用引用图邻域生成样本进行对比学习。在对比训练之前,该模型使用来自scibert - scivocab - uncased的权重进行初始化。底层的引用嵌入向量是在S2ORC引用图上进行训练的。
论文:Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)。
代码:https://github.com/malteos/scincl
PubMedNCL:如果处理生物医学论文,可以尝试使用PubMedNCL。
🚀 快速开始
✨ 主要特性
- 基于预训练的BERT模型,可生成研究论文的文档级嵌入向量。
- 利用引用图邻域进行对比学习。
- 模型初始化使用了预训练的
scibert - scivocab - uncased
权重。
📦 安装指南
文档未提及安装步骤,跳过此章节。
💻 使用示例
基础用法
使用Sentence Transformers
库调用预训练模型:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("malteos/scincl")
papers = [
"BERT [SEP] We introduce a new language representation model called BERT",
"Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
embeddings = model.encode(papers)
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
高级用法
使用Transformers
库调用预训练模型:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embeddings = result.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
📚 详细文档
三元组挖掘参数
属性 |
详情 |
seed |
4 |
triples_per_query |
5 |
easy_positives_count |
5 |
easy_positives_strategy |
5 |
easy_positives_k |
20 - 25 |
easy_negatives_count |
3 |
easy_negatives_strategy |
random_without_knn |
hard_negatives_count |
2 |
hard_negatives_strategy |
knn |
hard_negatives_k |
3998 - 4000 |
SciDocs结果
这些模型权重是在SciDocs上取得最佳结果的权重(seed = 4
)。在论文中,我们报告的SciDocs结果是十个随机种子的平均值。
模型 |
mag - f1 |
mesh - f1 |
co - view - map |
co - view - ndcg |
co - read - map |
co - read - ndcg |
cite - map |
cite - ndcg |
cocite - map |
cocite - ndcg |
recomm - ndcg |
recomm - P@1 |
平均值 |
Doc2Vec |
66.2 |
69.2 |
67.8 |
82.9 |
64.9 |
81.6 |
65.3 |
82.2 |
67.1 |
83.4 |
51.7 |
16.9 |
66.6 |
fasttext - sum |
78.1 |
84.1 |
76.5 |
87.9 |
75.3 |
87.4 |
74.6 |
88.1 |
77.8 |
89.6 |
52.5 |
18 |
74.1 |
SGC |
76.8 |
82.7 |
77.2 |
88 |
75.7 |
87.5 |
91.6 |
96.2 |
84.1 |
92.5 |
52.7 |
18.2 |
76.9 |
SciBERT |
79.7 |
80.7 |
50.7 |
73.1 |
47.7 |
71.1 |
48.3 |
71.7 |
49.7 |
72.6 |
52.1 |
17.9 |
59.6 |
SPECTER |
82 |
86.4 |
83.6 |
91.5 |
84.5 |
92.4 |
88.3 |
94.9 |
88.1 |
94.8 |
53.9 |
20 |
80 |
SciNCL (10 seeds) |
81.4 |
88.7 |
85.3 |
92.3 |
87.5 |
93.9 |
93.6 |
97.3 |
91.6 |
96.4 |
53.9 |
19.3 |
81.8 |
SciNCL (seed = 4) |
81.2 |
89.0 |
85.3 |
92.2 |
87.7 |
94.0 |
93.6 |
97.4 |
91.7 |
96.5 |
54.3 |
19.6 |
81.9 |
论文中提供了更多的评估结果。
📄 许可证
本项目采用MIT许可证。