๐ Korean Medical DPR(Dense Passage Retrieval)
This is a retrieval model with a Bi-Encoder structure that can be used in the medical field. To process medical records in a mix of Korean and English, SapBERT-KO-EN is used as the base model. The questions are encoded using the Question Encoder, and the texts are encoded using the Context Encoder.
(โป This model is trained on the Ultra-large AI Healthcare Q&A Data from AI Hub.)
โจ Features
Self Alignment Pretraining (SAP)
Korean medical records are written in a mix of Korean and English, so a model that can recognize English terms is needed. The model is trained using the Multi Similarity Loss to have high similarity between terms with the same code.
For example: C3843080 || Hypertension disease
C3843080 || Hypertension
C3843080 || High Blood Pressure
C3843080 || HTN
C3843080 || HBP
Dense Passage Retrieval (DPR)
To turn SapBERT-KO-EN into a retrieval model, additional fine-tuning is required. It is fine-tuned using the DPR method, which calculates the similarity between queries and texts with a Bi-Encoder structure. A dataset augmented with Korean-English mixed samples is used as follows:
For example: Korean disease name: Hypertension
English disease name: Hypertenstion
Query (original): I don't know what it is that my father has hypertension. Please explain what hypertension is.
Query (augmented): I don't know what it is that my father has Hypertenstion. Please explain what Hypertenstion is.
๐ง Technical Details
Self Alignment Pretraining (SAP)
The base model and hyperparameters used for training SapBERT-KO-EN are as follows. KOSTOM, a medical terminology dictionary containing Korean and English medical terms, is used as the training data.
Property |
Details |
Model |
klue/bert-base |
Dataset |
KOSTOM |
Epochs |
1 |
Batch Size |
64 |
Max Length |
64 |
Dropout |
0.1 |
Pooler |
'cls' |
Eval Step |
100 |
Threshold |
0.8 |
Scale Positive Sample |
1 |
Scale Negative Sample |
60 |
Dense Passage Retrieval (DPR)
The base model and hyperparameters used for fine-tuning are as follows.
Property |
Details |
Model |
SapBERT-KO-EN(klue/bert-base) |
Dataset |
Ultra-large AI Healthcare Q&A Data (AI Hub) |
Epochs |
10 |
Batch Size |
64 |
Dropout |
0.1 |
Pooler |
'cls' |
๐ป Usage Examples
Basic Usage
import numpy as np
from transformers import AutoModel, AutoTokenizer
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)
query = 'high blood pressure ์ฒ๋ฐฉ ์ฌ๋ก'
targets = [
"""๊ณ ํ์ ์ง๋จ.
ํ์ ์๋ด ๋ฐ ์ํ์ต๊ด ๊ต์ ๊ถ๊ณ . ์ ์ผ์, ๊ท์น์ ์ธ ์ด๋, ๊ธ์ฐ, ๊ธ์ฃผ ์ง์.
ํ์ ์ฌ๋ฐฉ๋ฌธ. ํ์: 150/95mmHg. ์ฝ๋ฌผ์น๋ฃ ์์. Amlodipine 5mg 1์ผ 1ํ ์ฒ๋ฐฉ.""",
"""์๊ธ์ค ๋์ฐฉ ํ ์ ๋ด์๊ฒฝ ์งํ.
์๊ฒฌ: Gastric ulcer์์ Forrest IIb ๊ด์ฐฐ๋จ. ์ถํ์ ์๋์ ์ผ์ถ์ฑ ์ถํ ํํ.
์ฒ์น: ์ํผ๋คํ๋ฆฐ ์ฃผ์ฌ๋ก ์ถํ ๊ฐ์ ํ์ธ. Hemoclip 2๊ฐ๋ก ์ถํ ๋ถ์ ํด๋ฆฌํํ์ฌ ์งํ ์๋ฃ.""",
"""ํ์ค ๋์ ์ง๋ฐฉ ์์น ๋ฐ ์ง๋ฐฉ๊ฐ ์๊ฒฌ.
๋ค๋ฐ์ฑ gallstones ํ์ธ. ์ฆ์ ์์ ๊ฒฝ์ฐ ๊ฒฝ๊ณผ ๊ด์ฐฐ ๊ถ์ฅ.
์ฐ์ธก renal cyst, ์์ฑ ๊ฐ๋ฅ์ฑ ๋์ผ๋ฉฐ ์ถ๊ฐ์ ์ธ ์ฒ์น ๋ถํ์ ํจ."""
]
query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()
def cos_sim(A, B):
return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
for idx, target in enumerate(targets):
target_feature = c_tokenizer(target, return_tensors='pt')
target_outputs = c_model(**target_feature, return_dict=True)
target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
similarity = cos_sim(query_embeddings, target_embeddings)
print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476
๐ License
This project is licensed under the MIT license.
๐ Documentation
Citing
@inproceedings{liu2021self,
title={Self-Alignment Pretraining for Biomedical Entity Representations},
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages={4228--4238},
month = jun,
year={2021}
}
@article{karpukhin2020dense,
title={Dense Passage Retrieval for Open-Domain Question Answering},
author={Vladimir Karpukhin, Barlas Oฤuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2020}
}