WikiMedical_sent_biobert Open-Source Model - Optimize Medical Wiki Content and Accurately Calculate Sentence Similarity

Wikimedical Sent Biobert

Developed by nuvocare

A sentence transformation model based on BioBERT, specifically optimized for medical Wikipedia content, used for sentence similarity calculation

Text Embedding

Transformers

#Medical text similarity #Biomedical BERT #Wikipedia Medicine

Downloads 118

Release Time : 10/18/2023

Model Overview

This model can map medical-related sentences and paragraphs into a 768-dimensional vector space, primarily used for medical text clustering and semantic search tasks

Model Features

Medical domain optimization

Specially trained on medical Wikipedia content, excelling in medical text processing

High-precision similarity calculation

Achieved Spearman score of 0.87 and Pearson score of 0.95 on the WikiMedical test set

Efficient vectorization

Can quickly convert sentences and paragraphs into 768-dimensional dense vectors for subsequent processing

Model Capabilities

Sentence embedding vectorization

Semantic similarity calculation

Medical text clustering

Medical content semantic search

Use Cases

Medical information retrieval

Wikipedia medical entry correlation

Determine whether two medical texts come from the same Wikipedia page

High accuracy (inferred from evaluation scores)

Medical knowledge management

Medical literature clustering

Automatically group similar medical content literature

🚀 WikiMedical_sent_biobert

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, useful for tasks like clustering and semantic search.

This model belongs to the sentence-similarity pipeline and is tagged with sentence-transformers, feature-extraction, sentence-similarity, and transformers. It is trained on the nuvocare/WikiMedical_sentence_similarity dataset.

WikiMedical_sent_bert is based on the dmis-lab/biobert-base-cased-v1.2 backbone and has been trained on the WikiMedical_sentence_simialrity dataset. It can predict whether two medical texts are related to the same Wikipedia page.

🚀 Quick Start

✨ Features

Maps sentences & paragraphs to a 768 dimensional dense vector space.
Can be used for clustering or semantic search.
Able to predict whether two medical texts are related to the same wikipedia page.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('nuvocare/WikiMedical_sent_biobert')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('WikiMedical_sent_biobert')
model = AutoModel.from_pretrained('WikiMedical_sent_biobert')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

The model is evaluated on the test set of WikiMedical_sentence_similarity. It achieves:

A cosine spearman score of 0.87
A cosine pearson score of 0.95

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 3170 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 2,
    "evaluation_steps": 2000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 300,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

No license information provided.

Citing & Authors

Samuel Chaineau

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご