Open-source medical retrieval model "medical-biencoder-ko-bert-context" - Handling Korean-English mixed medical records

Medical Biencoder Ko Bert Context

Developed by snumin44

A dual-encoder retrieval model for the medical field, capable of processing Korean-English mixed medical records.

KoreanOpen Source License:MIT #Korean-English mixed medical retrieval #Dense passage retrieval #Self-aligned pretraining

Downloads 18

Release Time : 8/27/2024

Model Overview

This model is built upon SapBERT-KO-EN for dense passage retrieval in the medical domain, efficiently matching medical questions with relevant texts.

Model Features

Korean-English mixed medical record processing

Optimized specifically for the common Korean-English mixed writing style in Korean medical records.

Self-aligned pretraining (SAP)

Uses multiple similarity loss functions to ensure high similarity between terms with the same code.

Dense passage retrieval (DPR)

Employs a dual-encoder architecture to compute similarity between queries and texts, suitable for large-scale retrieval tasks.

Model Capabilities

Medical text feature extraction

Korean-English mixed text processing

Dense passage retrieval

Semantic similarity calculation

Use Cases

Medical information retrieval

Medical question matching

Matches patient-posed medical questions with relevant content in knowledge bases.

Accurately identifies different expressions of the same disease.

Medical record classification

Classifies and organizes medical records for easier subsequent retrieval.

Improves retrieval efficiency in medical information systems.

🚀 Korean Medical DPR(Dense Passage Retrieval)

This is a retrieval model with a Bi-Encoder structure that can be used in the medical field. To process medical records in a mix of Korean and English, SapBERT-KO-EN is used as the base model. The questions are encoded using the Question Encoder, and the texts are encoded using the Context Encoder.

Question Encoder: https://huggingface.co/snumin44/medical-biencoder-ko-bert-question

(※ This model is trained on the Ultra-large AI Healthcare Q&A Data from AI Hub.)

✨ Features

Self Alignment Pretraining (SAP)

Korean medical records are written in a mix of Korean and English, so a model that can recognize English terms is needed. The model is trained using the Multi Similarity Loss to have high similarity between terms with the same code.

For example: C3843080 || Hypertension disease
             C3843080 || Hypertension
             C3843080 || High Blood Pressure
             C3843080 || HTN
             C3843080 || HBP

SapBERT-KO-EN: https://huggingface.co/snumin44/sap-bert-ko-en
Github: https://github.com/snumin44/SapBERT-KO-EN

Dense Passage Retrieval (DPR)

To turn SapBERT-KO-EN into a retrieval model, additional fine-tuning is required. It is fine-tuned using the DPR method, which calculates the similarity between queries and texts with a Bi-Encoder structure. A dataset augmented with Korean-English mixed samples is used as follows:

For example: Korean disease name: Hypertension
             English disease name: Hypertenstion
             Query (original): I don't know what it is that my father has hypertension. Please explain what hypertension is.
             Query (augmented): I don't know what it is that my father has Hypertenstion. Please explain what Hypertenstion is.

Github: https://github.com/millet04/DPR-KO

🔧 Technical Details

Self Alignment Pretraining (SAP)

The base model and hyperparameters used for training SapBERT-KO-EN are as follows. KOSTOM, a medical terminology dictionary containing Korean and English medical terms, is used as the training data.

Property	Details
Model	klue/bert-base
Dataset	KOSTOM
Epochs	1
Batch Size	64
Max Length	64
Dropout	0.1
Pooler	'cls'
Eval Step	100
Threshold	0.8
Scale Positive Sample	1
Scale Negative Sample	60

Dense Passage Retrieval (DPR)

The base model and hyperparameters used for fine-tuning are as follows.

Property	Details
Model	SapBERT-KO-EN(klue/bert-base)
Dataset	Ultra-large AI Healthcare Q&A Data (AI Hub)
Epochs	10
Batch Size	64
Dropout	0.1
Pooler	'cls'

💻 Usage Examples

Basic Usage

import numpy as np
from transformers import AutoModel, AutoTokenizer

# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)

# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)


query = 'high blood pressure 처방 사례'

targets = [
    """고혈압 진단.
    환자 상담 및 생활습관 교정 권고. 저염식, 규칙적인 운동, 금연, 금주 지시.
    환자 재방문. 혈압: 150/95mmHg. 약물치료 시작. Amlodipine 5mg 1일 1회 처방.""",
    
    """응급실 도착 후 위 내시경 진행.
    소견: Gastric ulcer에서 Forrest IIb 관찰됨. 출혈은 소량의 삼출성 출혈 형태.
    처치: 에피네프린 주사로 출혈 감소 확인. Hemoclip 2개로 출혈 부위 클리핑하여 지혈 완료.""",
    
    """혈중 높은 지방 수치 및 지방간 소견.
    다발성 gallstones 확인. 증상 없을 경우 경과 관찰 권장.
    우측 renal cyst, 양성 가능성 높으며 추가적인 처치 불필요 함."""
]

query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = c_tokenizer(target, return_tensors='pt')
    target_outputs = c_model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")

Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476

📄 License

This project is licensed under the MIT license.

📚 Documentation

Citing

@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご