Declutr-sci-base Open-source Scientific Text Encoder - Free to Process Encoding of 2 Million Paper Sentences

Declutr Sci Base

Developed by johngiorgi

SciBERT-based scientific text sentence encoder, trained on 2 million scientific papers through self-supervised learning

Text Embedding EnglishOpen Source License:Apache-2.0 #Scientific Text Embedding #Unsupervised Contrastive Learning #Paper Semantic Matching

Downloads 50

Release Time : 3/2/2022

Model Overview

This model is a sentence encoder specifically optimized for scientific texts, capable of converting sentences into high-dimensional vector representations for tasks such as calculating sentence similarity.

Model Features

Scientific Text Optimization

Pre-trained specifically for scientific literature, excels in scientific domain texts

Self-Supervised Learning

Uses DeCLUTR self-supervised training strategy, requiring no labeled data

Sentence-Level Embedding

Capable of encoding entire sentences into fixed-length vector representations

Model Capabilities

Sentence Embedding

Semantic Similarity Calculation

Scientific Text Feature Extraction

Use Cases

Academic Research

Literature Retrieval

Find relevant scientific literature through semantic similarity

Improves retrieval accuracy and relevance

Paper Recommendation

Recommend related research papers based on content similarity

Text Analysis

Scientific Text Clustering

Group similar scientific paper abstracts together

🚀 DeCLUTR-sci-base

A sentence encoder model, especially suitable for scientific text, based on extended pretraining of scibert_scivocab_uncased with self - supervised learning strategy.

🚀 Quick Start

This model is designed to be used as a sentence encoder, similar to Google's Universal Sentence Encoder or Sentence Transformers. It's particularly well - suited for scientific text.

✨ Features

Based on the allenai/scibert_scivocab_uncased model.
Extended pretraining on over 2 million scientific papers from S2ORC.
Uses the self - supervised training strategy presented in DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.

📦 Installation

No specific installation steps are provided in the original README. If you want to use this model, you need to install relevant libraries such as sentence - transformers or transformers.

💻 Usage Examples

Basic Usage

With SentenceTransformers

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("johngiorgi/declutr-sci-base")

# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]

# Embed the text
embeddings = model.encode(texts)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

With 🤗 Transformers

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

📚 Documentation

Model description

This is the allenai/scibert_scivocab_uncased model, with extended pretraining on over 2 million scientific papers from S2ORC using the self - supervised training strategy presented in DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.

Intended uses & limitations

The model is intended to be used as a sentence encoder, similar to Google's Universal Sentence Encoder or Sentence Transformers. It is particularly suitable for scientific text.

How to use

Please see our repo for full details.

📄 License

The model is released under the apache - 2.0 license.

BibTeX entry and citation info

@inproceedings{giorgi-etal-2021-declutr,
    title        = {{D}e{CLUTR}: Deep Contrastive Learning for Unsupervised Textual Representations},
    author       = {Giorgi, John  and Nitski, Osvald  and Wang, Bo  and Bader, Gary},
    year         = 2021,
    month        = aug,
    booktitle    = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    publisher    = {Association for Computational Linguistics},
    address      = {Online},
    pages        = {879--895},
    doi          = {10.18653/v1/2021.acl-long.72},
    url          = {https://aclanthology.org/2021.acl-long.72}
}

Property	Details
Model Type	Sentence encoder
Training Data	Over 2 million scientific papers from S2ORC

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご