Open-source Model for Intent Recognition in Chinese Medical Consultations by RoBERTa - Accurately Distinguish Medical Consultations from Daily Chats

Roberta Chinese Med Inquiry Intention Recognition Base

Developed by StevenZhun

This model is a text classification model fine-tuned based on RoBERTa, specializing in identifying whether user input text belongs to medical consultation or casual conversation categories.

Text Classification

Safetensors

ChineseOpen Source License:Apache-2.0 #Medical Intent Recognition #High-precision Classification #Chinese Medical Inquiry Triage

Downloads 23

Release Time : 10/19/2024

Model Overview

A sub-module developed by the G60 Smart Health Innovation Research Institute (Anhui) for intent recognition tasks in mental health dialogue triage systems, capable of accurately distinguishing between medical inquiries and casual conversation texts.

Model Features

High Accuracy

Achieves 99% accuracy and 98% F1 score on the test set, demonstrating excellent performance.

Medical Domain Optimization

Specifically optimized for medical consultation scenarios, incorporating both open-source and internal medical domain dialogue data.

Lightweight Deployment

Fine-tuned based on the RoBERTa base model, suitable for deployment on resource-limited devices.

Model Capabilities

Medical Inquiry Intent Recognition

Casual Conversation Recognition

Text Classification

Use Cases

Healthcare

Online Medical Consultation System

Used in online medical platforms to automatically identify user consultation intent, distinguishing between medical questions and casual conversations.

Accuracy 99%, F1 score 98%

Mental Health Dialogue System

Integrated into mental health dialogue systems to help identify whether users are inquiring about medical-related issues.

🚀 Fine-tuned RoBERTa-based Medical Inquiry Intention Recognition Model

This project is a medical inquiry intention recognition model fine-tuned on RoBERTa, which can identify whether the user's query is for medical consultation or casual chat.

🚀 Quick Start

Project Introduction

Project Source: The project is part of the dialogue-based medical consultation system developed by Zhongke (Anhui) G60 Smart Health Innovation Research Institute (hereinafter referred to as "Zhongke") for the research and development of a large mental health model. This project focuses on the intention recognition task.
Model Purpose: The model is used to recognize the intention of the query text input by the user in the dialogue system, distinguishing whether the intention is [Medical Inquiry] or [Casual Chat].

✨ Features

Data Description

Data Source: The dataset is constructed by merging and preprocessing the open-source dialogue dataset from Hugging Face and the in-house medical dialogue dataset of Zhongke.
Data Partition: There are a total of 6000 samples, including 4800 samples in the training set and 1200 samples in the test set. The balance of positive and negative samples is ensured during the data construction process.
Data Examples:

[
    {
        "query": "What are the names of the 5 most popular movies recently?",
        "label": "nonmed"
    },
    {
        "query": "What could be the cause of joint pain and foot pain?",
        "label": "med"
    },
    {
        "query": "I've been having cold sweats, stomachaches, nausea, and vomiting recently, which seriously affects my study and work.",
        "label": "med"
    }
]

Experimental Environment

Featurize Online Platform Instance:

CPU: 6-core E5-2680 V4
GPU: RTX3060, 12.6GB VRAM
Pre-installed Image: Ubuntu 20.04, Python 3.9/3.10, PyTorch 2.0.1, TensorFlow 2.13.0, Docker 20.10.10, CUDA should be kept up-to-date as much as possible.
Libraries to be Manually Installed:

pip install transformers datasets evaluate accelerate

Training Method

The model is fine-tuned on the chinese-roberta-wwm-ext Chinese pre-trained model released by the Harbin Institute of Technology and iFlytek Joint Laboratory (HFL) using the transformers library from Hugging Face.

Training Parameters, Results, and Limitations

Training Parameters

{
    "output_dir": "output",
    "num_train_epochs": 2,
    "learning_rate": 3e-5,
    "lr_scheduler_type": "cosine",
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 16,
    "weight_decay": 0.01,
    "warmup_ratio": 0.02,
    "logging_steps": 0.01,
    "logging_strategy": "steps",
    "fp16": true,
    "eval_strategy": "steps",
    "eval_steps": 0.1,
    "save_strategy": "epoch"
}

Fine-tuning Results | Dataset | Accuracy | F1 Score | | ------ | ------ | ------ | | Test Set | 0.99 | 0.98 |
Limitations Overall, the fine-tuned model performs well in medical inquiry intention recognition. However, due to the limited amount of data and the lack of sample diversity used for model training, the performance may deviate in some cases.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

ID2LABEL = {0: "Casual Chat", 1: "Medical Inquiry"}
MODEL_NAME = 'HZhun/RoBERTa-Chinese-Med-Inquiry-Intention-Recognition-base'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    torch_dtype='auto'
)

query = 'The child is currently 28 years old. When in a bad mood, he often vomits blood without warning. Multiple examinations have been conducted on the respiratory and digestive systems, but no results have been found. He has been vomiting blood continuously in the morning for the past three days.'

tokenized_query = tokenizer(query, return_tensors='pt')
tokenized_query = {k: v.to(model.device) for k, v in tokenized_query.items()}
outputs = model(**tokenized_query)
pred_id = outputs.logits.argmax(-1).item()
intent = ID2LABEL[pred_id]
print(intent)

Terminal Output

Medical Inquiry

Advanced Usage

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

ID2LABEL = {0: "Casual Chat", 1: "Medical Inquiry"}
MODEL_NAME = 'HZhun/RoBERTa-Chinese-Med-Inquiry-Intention-Recognition-base'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side='left')
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    torch_dtype='auto'
)

query = [
    'I have a stomachache and have been having diarrhea for several days. Sometimes I vomit in the middle of the night.',
    'How can I remove the hair on my legs without using any pharmaceutical or medical devices?',
    'Hello, what medicine should I take for a cold and cough?',
    'What do you think of the weather today? I think we can go camping!'
]

tokenized_query = tokenizer(query, return_tensors='pt', padding=True, truncation=True)
tokenized_query = {k: v.to(model.device) for k, v in tokenized_query.items()}
outputs = model(**tokenized_query)
pred_ids = outputs.logits.argmax(-1).tolist()
intent = [ID2LABEL[pred_id] for pred_id in pred_ids]
print(intent)

Terminal Output

["Medical Inquiry", "Casual Chat", "Medical Inquiry", "Casual Chat"]

📚 Documentation

Property	Details
Base Model	hfl/chinese-roberta-wwm-ext
Pipeline Tag	text-classification
Tags	medical
Metrics	confusion_matrix, accuracy, f1

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご