🚀 Medical Inquiry Intention Recognition Model Fine-tuned on RoBERTa
This project is a text classification model for medical inquiry intention recognition. It can identify whether the user's query in the dialogue system is for medical consultation or casual chat.
📦 Metadata
Property |
Details |
License |
Apache-2.0 |
Base Model |
hfl/chinese-roberta-wwm-ext |
Pipeline Tag |
text-classification |
Tags |
medical |
Metrics |
confusion_matrix, accuracy, f1 |
🚀 Quick Start
Project Introduction
- Project Source: This project is part of the dialogue-guided diagnosis system developed by Zhongke (Anhui) G60 Smart Health Innovation Research Institute (hereinafter referred to as "Zhongke") for the research and development of a mental health large model. It focuses on the intention recognition task.
- Model Purpose: The model is used to identify the intention of the
query
text input by the user in the dialogue system, determining whether it is for "medical consultation" or "casual chat".
Data Description
- Data Source: The dataset is constructed by integrating and preprocessing the open-source dialogue dataset from Hugging Face and the in-house medical dialogue dataset of Zhongke.
- Data Partition: There are a total of 6000 samples, including 4800 in the training set and 1200 in the test set. The balance of positive and negative samples is ensured during the data construction process.
- Data Example:
[
{
"query": "What are the names of the 5 most popular movies recently?",
"label": "nonmed"
},
{
"query": "What could be the cause of joint pain and foot pain?",
"label": "med"
},
{
"query": "I've been having cold sweats, stomachaches, nausea, and vomiting recently, which seriously affects my study and work.",
"label": "med"
}
]
Experimental Environment
Featurize Online Platform Instance:
- CPU: 6-core E5-2680 V4
- GPU: RTX3060, 12.6GB VRAM
- Pre-installed Image: Ubuntu 20.04, Python 3.9/3.10, PyTorch 2.0.1, TensorFlow 2.13.0, Docker 20.10.10, CUDA (keep it as up-to-date as possible)
- Libraries to be Manually Installed:
pip install transformers datasets evaluate accelerate
Training Method
- Fine-tune the chinese-roberta-wwm-ext Chinese pre-trained model released by the Harbin Institute of Technology and iFlytek Joint Laboratory (HFL) based on the
transformers
library of Hugging Face.
Training Parameters, Results, and Limitations
{
"output_dir": "output",
"num_train_epochs": 2,
"learning_rate": 3e-5,
"lr_scheduler_type": "cosine",
"per_device_train_batch_size": 16,
"per_device_eval_batch_size": 16,
"weight_decay": 0.01,
"warmup_ratio": 0.02,
"logging_steps": 0.01,
"logging_strategy": "steps",
"fp16": true,
"eval_strategy": "steps",
"eval_steps": 0.1,
"save_strategy": "epoch"
}
- Fine-tuning Results:
| Dataset | Accuracy | F1 Score |
| ------ | ------ | ------ |
| Test Set | 0.99 | 0.98 |
- Limitations:
⚠️ Important Note
Overall, the fine-tuned model performs well in recognizing the intention of medical inquiries. However, due to the limited amount of data used for model training and the lack of sample diversity, the performance may deviate in some cases.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
ID2LABEL = {0: "Casual Chat", 1: "Medical Consultation"}
MODEL_NAME = 'HZhun/RoBERTa-Chinese-Med-Inquiry-Intention-Recognition-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
torch_dtype='auto'
)
query = 'The child is currently 28 years old. When in a bad mood, he often spits up blood without warning. Multiple examinations of the respiratory and digestive systems have not yielded any results. He has been spitting up blood every morning for the past three days.'
tokenized_query = tokenizer(query, return_tensors='pt')
tokenized_query = {k: v.to(model.device) for k, v in tokenized_query.items()}
outputs = model(**tokenized_query)
pred_id = outputs.logits.argmax(-1).item()
intent = ID2LABEL[pred_id]
print(intent)
Terminal Output:
Medical Consultation
Advanced Usage
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
ID2LABEL = {0: "Casual Chat", 1: "Medical Consultation"}
MODEL_NAME = 'HZhun/RoBERTa-Chinese-Med-Inquiry-Intention-Recognition-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side='left')
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
torch_dtype='auto'
)
query = [
'I have a stomachache and have been having diarrhea for several days. Sometimes I vomit in the middle of the night.',
'How can I remove the hair on my legs without using any pharmaceutical or medical devices?',
'Hello, what medicine should I take for a cold and cough?',
'What do you think of the weather today? I think we can go camping!'
]
tokenized_query = tokenizer(query, return_tensors='pt', padding=True, truncation=True)
tokenized_query = {k: v.to(model.device) for k, v in tokenized_query.items()}
outputs = model(**tokenized_query)
pred_ids = outputs.logits.argmax(-1).tolist()
intent = [ID2LABEL[pred_id] for pred_id in pred_ids]
print(intent)
Terminal Output:
["Medical Consultation", "Casual Chat", "Medical Consultation", "Casual Chat"]
📄 License
This project is licensed under the Apache-2.0 license.