オープンソースSciNCLモデル - 無料で研究論文の文書レベルの埋め込み表現を生成

ホーム

Scincl

malteosによって開発

SciNCLは、研究論文の文書レベルの埋め込み表現を生成するための事前学習済みのBERT言語モデルで、引用文グラフの近傍関係を利用して対比学習で訓練されています。

テキスト埋め込み

Transformers

英語オープンソースライセンス:MIT #科学文献埋め込み #引用文対比学習 #文書レベルの表現

ダウンロード数 6,744

リリース時間 : 3/2/2022

モデル概要

このモデルは、科学文献の埋め込み表現生成に特化しており、対比学習によって文書レベルの意味表現を最適化し、学術論文の類似度計算や推薦システムに適しています。

モデル特徴

引用文グラフ強化訓練

S2ORC引用文グラフの近傍関係を利用して対比学習サンプルを生成し、文書表現の品質を向上させます。

科学分野最適化

科学文献用に設計され、SciDocs評価基準で優れた性能を発揮します。

双テキストエンコード

タイトルと要約の連合エンコードをサポートします（[SEP]タグで接続）

モデル能力

科学文献埋め込み表現生成

文書類似度計算

学術論文推薦

使用事例

学術研究

🚀 SciNCL

SciNCLは、研究論文のドキュメントレベルの埋め込みベクトルを生成するための事前学習済みのBERT言語モデルです。引用グラフの近傍を利用してサンプルを生成し、対照学習を行います。対照学習の前に、このモデルはscibert - scivocab - uncasedの重みを使用して初期化されます。基礎となる引用埋め込みベクトルはS2ORC引用グラフ上で学習されています。

論文：Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)。

コード：https://github.com/malteos/scincl

PubMedNCL：生物医学論文を扱う場合は、PubMedNCLを試してみることができます。

🚀 クイックスタート

✨ 主な機能

事前学習済みのBERTモデルに基づいて、研究論文のドキュメントレベルの埋め込みベクトルを生成できます。
引用グラフの近傍を利用して対照学習を行います。
モデルの初期化には事前学習済みのscibert - scivocab - uncasedの重みが使用されています。

💻 使用例

基本的な使用法

Sentence Transformersライブラリを使用して事前学習済みモデルを呼び出す：

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

高度な使用法

Transformersライブラリを使用して事前学習済みモデルを呼び出す：

from transformers import AutoTokenizer, AutoModel

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
          {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]

# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]

# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# inference
result = model(**inputs)

# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]

# calculate the similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())
# => 0.8440518379211426

📚 ドキュメント

三元組マイニングパラメータ

属性	詳細
seed	4
triples_per_query	5
easy_positives_count	5
easy_positives_strategy	5
easy_positives_k	20 - 25
easy_negatives_count	3
easy_negatives_strategy	random_without_knn
hard_negatives_count	2
hard_negatives_strategy	knn
hard_negatives_k	3998 - 4000

SciDocsの結果

これらのモデルの重みは、SciDocsで最良の結果を得た重み（seed = 4）です。論文では、SciDocsの結果として10個のランダムシードの平均値を報告しています。

モデル	mag - f1	mesh - f1	co - view - map	co - view - ndcg	co - read - map	co - read - ndcg	cite - map	cite - ndcg	cocite - map	cocite - ndcg	recomm - ndcg	recomm - P@1	平均値
Doc2Vec	66.2	69.2	67.8	82.9	64.9	81.6	65.3	82.2	67.1	83.4	51.7	16.9	66.6
fasttext - sum	78.1	84.1	76.5	87.9	75.3	87.4	74.6	88.1	77.8	89.6	52.5	18	74.1
SGC	76.8	82.7	77.2	88	75.7	87.5	91.6	96.2	84.1	92.5	52.7	18.2	76.9
SciBERT	79.7	80.7	50.7	73.1	47.7	71.1	48.3	71.7	49.7	72.6	52.1	17.9	59.6
SPECTER	82	86.4	83.6	91.5	84.5	92.4	88.3	94.9	88.1	94.8	53.9	20	80
SciNCL (10 seeds)	81.4	88.7	85.3	92.3	87.5	93.9	93.6	97.3	91.6	96.4	53.9	19.3	81.8
SciNCL (seed = 4)	81.2	89.0	85.3	92.2	87.7	94.0	93.6	97.4	91.7	96.5	54.3	19.6	81.9