declutr-sci-base开源科学文本编码器 - 免费处理200万篇论文句子编码

首页

Declutr Sci Base

由 johngiorgi 开发

基于SciBERT的科学文本句子编码器，通过自监督学习在200万篇科学论文上训练

文本嵌入英语开源协议:Apache-2.0 #科学文本嵌入 #无监督对比学习 #论文语义匹配

下载量 50

发布时间 : 3/2/2022

模型简介

该模型是一个专门针对科学文本优化的句子编码器，能够将句子转换为高维向量表示，用于计算句子相似度等任务。

模型特点

科学文本优化

专门针对科学文献进行预训练，在科学领域文本上表现优异

自监督学习

采用DeCLUTR自监督训练策略，无需标注数据

句子级嵌入

能够将整个句子编码为固定长度的向量表示

模型能力

句子嵌入

语义相似度计算

科学文本特征提取

使用案例

学术研究

文献检索

通过语义相似度查找相关科学文献

提高检索准确性和相关性

论文推荐

基于内容相似度推荐相关研究论文

文本分析

科学文本聚类

将相似的科学论文摘要分组

🚀 DeCLUTR-sci-base

DeCLUTR-sci-base是一个用于句子相似度计算的模型，它基于科学文献进行预训练，能够为科学文本提供高质量的句子嵌入表示，可广泛应用于科学文本的语义相似度计算等任务。

🚀 快速开始

模型描述

这是基于 allenai/scibert_scivocab_uncased 的模型，使用 DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations 中提出的自监督训练策略，在来自 S2ORC 的超过200万篇科学论文上进行了扩展预训练。

预期用途和限制

该模型旨在用作句子编码器，类似于 Google的通用句子编码器或 Sentence Transformers，尤其适用于科学文本。

如何使用

完整详情请参阅我们的仓库，以下是简单示例。

💻 使用示例

基础用法

使用 SentenceTransformers

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# 加载模型
model = SentenceTransformer("johngiorgi/declutr-sci-base")

# 准备要嵌入的文本
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]

# 嵌入文本
embeddings = model.encode(texts)

# 通过余弦距离计算语义相似度
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

使用 🤗 Transformers

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# 加载模型
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

# 准备要嵌入的文本
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# 嵌入文本
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# 对词级别的嵌入进行平均池化以获得句子级别的嵌入
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# 通过余弦距离计算语义相似度
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

BibTeX引用和引用信息

@inproceedings{giorgi-etal-2021-declutr,
    title        = {{D}e{CLUTR}: Deep Contrastive Learning for Unsupervised Textual Representations},
    author       = {Giorgi, John  and Nitski, Osvald  and Wang, Bo  and Bader, Gary},
    year         = 2021,
    month        = aug,
    booktitle    = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    publisher    = {Association for Computational Linguistics},
    address      = {Online},
    pages        = {879--895},
    doi          = {10.18653/v1/2021.acl-long.72},
    url          = {https://aclanthology.org/2021.acl-long.72}
}