declutr-sci-base開源科學文本編碼器 - 免費處理200萬篇論文句子編碼

首頁

Declutr Sci Base

由johngiorgi開發

基於SciBERT的科學文本句子編碼器，通過自監督學習在200萬篇科學論文上訓練

文本嵌入英語開源協議:Apache-2.0 #科學文本嵌入 #無監督對比學習 #論文語義匹配

下載量 50

發布時間 : 3/2/2022

模型概述

該模型是一個專門針對科學文本優化的句子編碼器，能夠將句子轉換為高維向量表示，用於計算句子相似度等任務。

模型特點

科學文本優化

專門針對科學文獻進行預訓練，在科學領域文本上表現優異

自監督學習

採用DeCLUTR自監督訓練策略，無需標註數據

句子級嵌入

能夠將整個句子編碼為固定長度的向量表示

模型能力

句子嵌入

語義相似度計算

科學文本特徵提取

使用案例

學術研究

文獻檢索

通過語義相似度查找相關科學文獻

提高檢索準確性和相關性

論文推薦

基於內容相似度推薦相關研究論文

文本分析

科學文本聚類

將相似的科學論文摘要分組

🚀 DeCLUTR-sci-base

DeCLUTR-sci-base是一個用於句子相似度計算的模型，它基於科學文獻進行預訓練，能夠為科學文本提供高質量的句子嵌入表示，可廣泛應用於科學文本的語義相似度計算等任務。

🚀 快速開始

模型描述

這是基於 allenai/scibert_scivocab_uncased 的模型，使用 DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations 中提出的自監督訓練策略，在來自 S2ORC 的超過200萬篇科學論文上進行了擴展預訓練。

預期用途和限制

該模型旨在用作句子編碼器，類似於 Google的通用句子編碼器或 Sentence Transformers，尤其適用於科學文本。

如何使用

完整詳情請參閱我們的倉庫，以下是簡單示例。

💻 使用示例

基礎用法

使用 SentenceTransformers

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# 加載模型
model = SentenceTransformer("johngiorgi/declutr-sci-base")

# 準備要嵌入的文本
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]

# 嵌入文本
embeddings = model.encode(texts)

# 通過餘弦距離計算語義相似度
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

使用 🤗 Transformers

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# 加載模型
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

# 準備要嵌入的文本
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# 嵌入文本
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# 對詞級別的嵌入進行平均池化以獲得句子級別的嵌入
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# 通過餘弦距離計算語義相似度
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

BibTeX引用和引用信息

@inproceedings{giorgi-etal-2021-declutr,
    title        = {{D}e{CLUTR}: Deep Contrastive Learning for Unsupervised Textual Representations},
    author       = {Giorgi, John  and Nitski, Osvald  and Wang, Bo  and Bader, Gary},
    year         = 2021,
    month        = aug,
    booktitle    = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    publisher    = {Association for Computational Linguistics},
    address      = {Online},
    pages        = {879--895},
    doi          = {10.18653/v1/2021.acl-long.72},
    url          = {https://aclanthology.org/2021.acl-long.72}
}