đ Fine-tuned RoBERTa-based Medical Inquiry Intention Recognition Model
This project is a medical inquiry intention recognition model fine-tuned on RoBERTa, which can identify whether the user's query is for medical consultation or casual chat.
đ Quick Start
Project Introduction
- Project Source: The project is part of the dialogue-based medical consultation system developed by Zhongke (Anhui) G60 Smart Health Innovation Research Institute (hereinafter referred to as "Zhongke") for the research and development of a large mental health model. This project focuses on the intention recognition task.
- Model Purpose: The model is used to recognize the intention of the
query
text input by the user in the dialogue system, distinguishing whether the intention is [Medical Inquiry] or [Casual Chat].
⨠Features
Data Description
- Data Source: The dataset is constructed by merging and preprocessing the open-source dialogue dataset from Hugging Face and the in-house medical dialogue dataset of Zhongke.
- Data Partition: There are a total of 6000 samples, including 4800 samples in the training set and 1200 samples in the test set. The balance of positive and negative samples is ensured during the data construction process.
- Data Examples:
[
{
"query": "What are the names of the 5 most popular movies recently?",
"label": "nonmed"
},
{
"query": "What could be the cause of joint pain and foot pain?",
"label": "med"
},
{
"query": "I've been having cold sweats, stomachaches, nausea, and vomiting recently, which seriously affects my study and work.",
"label": "med"
}
]
Experimental Environment
Featurize Online Platform Instance:
- CPU: 6-core E5-2680 V4
- GPU: RTX3060, 12.6GB VRAM
- Pre-installed Image: Ubuntu 20.04, Python 3.9/3.10, PyTorch 2.0.1, TensorFlow 2.13.0, Docker 20.10.10, CUDA should be kept up-to-date as much as possible.
- Libraries to be Manually Installed:
pip install transformers datasets evaluate accelerate
Training Method
- The model is fine-tuned on the chinese-roberta-wwm-ext Chinese pre-trained model released by the Harbin Institute of Technology and iFlytek Joint Laboratory (HFL) using the
transformers
library from Hugging Face.
Training Parameters, Results, and Limitations
{
"output_dir": "output",
"num_train_epochs": 2,
"learning_rate": 3e-5,
"lr_scheduler_type": "cosine",
"per_device_train_batch_size": 16,
"per_device_eval_batch_size": 16,
"weight_decay": 0.01,
"warmup_ratio": 0.02,
"logging_steps": 0.01,
"logging_strategy": "steps",
"fp16": true,
"eval_strategy": "steps",
"eval_steps": 0.1,
"save_strategy": "epoch"
}
- Fine-tuning Results
| Dataset | Accuracy | F1 Score |
| ------ | ------ | ------ |
| Test Set | 0.99 | 0.98 |
- Limitations
Overall, the fine-tuned model performs well in medical inquiry intention recognition. However, due to the limited amount of data and the lack of sample diversity used for model training, the performance may deviate in some cases.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
ID2LABEL = {0: "Casual Chat", 1: "Medical Inquiry"}
MODEL_NAME = 'HZhun/RoBERTa-Chinese-Med-Inquiry-Intention-Recognition-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
torch_dtype='auto'
)
query = 'The child is currently 28 years old. When in a bad mood, he often vomits blood without warning. Multiple examinations have been conducted on the respiratory and digestive systems, but no results have been found. He has been vomiting blood continuously in the morning for the past three days.'
tokenized_query = tokenizer(query, return_tensors='pt')
tokenized_query = {k: v.to(model.device) for k, v in tokenized_query.items()}
outputs = model(**tokenized_query)
pred_id = outputs.logits.argmax(-1).item()
intent = ID2LABEL[pred_id]
print(intent)
Terminal Output
Medical Inquiry
Advanced Usage
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
ID2LABEL = {0: "Casual Chat", 1: "Medical Inquiry"}
MODEL_NAME = 'HZhun/RoBERTa-Chinese-Med-Inquiry-Intention-Recognition-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side='left')
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
torch_dtype='auto'
)
query = [
'I have a stomachache and have been having diarrhea for several days. Sometimes I vomit in the middle of the night.',
'How can I remove the hair on my legs without using any pharmaceutical or medical devices?',
'Hello, what medicine should I take for a cold and cough?',
'What do you think of the weather today? I think we can go camping!'
]
tokenized_query = tokenizer(query, return_tensors='pt', padding=True, truncation=True)
tokenized_query = {k: v.to(model.device) for k, v in tokenized_query.items()}
outputs = model(**tokenized_query)
pred_ids = outputs.logits.argmax(-1).tolist()
intent = [ID2LABEL[pred_id] for pred_id in pred_ids]
print(intent)
Terminal Output
["Medical Inquiry", "Casual Chat", "Medical Inquiry", "Casual Chat"]
đ Documentation
Property |
Details |
Base Model |
hfl/chinese-roberta-wwm-ext |
Pipeline Tag |
text-classification |
Tags |
medical |
Metrics |
confusion_matrix, accuracy, f1 |
đ License
This project is licensed under the Apache-2.0 license.