WiNGPT2-7B-Base Open-source Medical Large Model - Providing Intelligent Q&A and Diagnostic Support Services

Wingpt2 7B Base

Developed by winninghealth

WiNGPT2 is a large medical vertical domain model based on the GPT architecture, dedicated to integrating professional medical knowledge, healthcare information, and data to provide intelligent Q&A, diagnostic support, and medical knowledge services for the industry.

Large Language Model

Transformers

ChineseOpen Source License:Apache-2.0 #Medical Q&A #Medical Knowledge Graph #Multi-turn Diagnostic Dialogue

Downloads 26

Release Time : 9/26/2023

Model Overview

WiNGPT2 is a large language model focused on the medical field, aiming to enhance diagnostic efficiency and healthcare service quality through intelligent Q&A, diagnostic support, and medical knowledge services.

Model Features

Medical Q&A

Covers comprehensive medical question answering across symptoms, treatments, medications, prevention, and examinations.

Text Understanding

Accurately parses medical terminology, patient records, and other professional texts for key information extraction and classification.

Multi-turn Dialogue

Simulates professional roles like doctors to provide precise responses based on contextual understanding.

Multi-task Support

Adaptable to 32 medical tasks, covering 8 major scenarios and 18 sub-scenarios.

High Accuracy

Trained on massive medical corpora to significantly reduce misdiagnosis risks.

Scenario Optimization

Specialized tuning for real-world medical needs to enhance practical applicability.

Continuous Evolution

Incorporates the latest medical research findings in real-time for ongoing model capability iteration.

Model Capabilities

Medical Q&A

Text Understanding

Multi-turn Dialogue

Diagnostic Support

Medical Knowledge Services

Use Cases

Medical Intelligent Q&A

Symptom Consultation

Users input symptoms, and the model provides possible disease diagnoses and treatment suggestions.

Improves the accuracy and efficiency of patient self-diagnosis.

Medication Inquiry

Users query drug information, and the model provides detailed medication instructions and usage advice.

Helps patients use medications correctly, reducing medication errors.

Diagnostic Support

Medical Record Analysis

Doctors input patient records, and the model offers diagnostic suggestions and treatment plans.

Assists doctors in improving diagnostic efficiency and accuracy.

Medical Knowledge Services

Clinical Guideline Query

Users query clinical guidelines for specific diseases, and the model provides the latest guideline content.

Helps medical practitioners quickly access authoritative medical information.

🚀 WiNGPT2

WiNGPT2 is a GPT-based large model specifically designed for the medical vertical field. It aims to integrate professional medical knowledge, information, and data, providing intelligent information services such as medical Q&A, diagnostic support, and medical knowledge to the healthcare industry, thereby improving the efficiency of diagnosis and treatment and the quality of medical services.

🚀 Quick Start

WiNGPT (Winning Health's large medical language model, hereinafter referred to as WiNGPT) began its R & D and training in January 2023.

In March, Winning Health's AI laboratory completed the feasibility verification of WiNGPT - 001 and started internal testing. WiNGPT - 001 uses a general GPT architecture with 6 billion parameters and realizes the whole - process self - research from pre - training to fine - tuning.

In May this year, the training data volume of WiNGPT - 001 reached 9,720 items of drug knowledge, 18 drug types, more than 7,200 items of disease knowledge, more than 2,800 items of examination and inspection knowledge, knowledge from 53 books, and more than 1,100 guide documents. The total number of training tokens reached 3.7 billion.

In July, WiNGPT was upgraded to 7B and adopted the latest model architecture. It added the retrieval - enhanced generation ability and started the training of the 13B model and industry invitation testing.

In September, WiNGPT迎来最新版本迭代,推出了全新的WiNGPT2,新版本可以被轻松扩展和个性化并用于下游各种应用场景。

To give back to the open - source community, we open - sourced the WiNGPT2 - 7B version. Our original intention is to accelerate the common development of medical large language model technology and the industry through more open - source projects and ultimately benefit human health.

✨ Features

Core Functions

Medical Knowledge Q&A: It can answer questions about medicine, health, diseases, etc., including but not limited to symptoms, treatments, drugs, prevention, and examinations.
Natural Language Understanding: It can understand medical text information such as medical terms and medical records, and provide key information extraction and classification.
Multi - turn Dialogue: It can play various medical professional roles such as doctors to have conversations with users and provide more accurate answers based on the context.
Multi - task Support: It supports 32 medical tasks, 18 sub - scenarios in eight major medical scenarios.

Model Architecture

It is a large language model with 7 billion parameters based on Transformer, using RoPE relative position encoding, SwiGLU activation function, and RMSNorm. The training uses Qwen - 7b¹ as the basic pre - trained model.

Main Characteristics

High Accuracy: Trained on a large - scale medical corpus, it has high accuracy and low possibility of misdiagnosis.
Scenario - oriented: It is specially optimized and customized for different medical scenarios and real needs to better serve application implementation.
Iterative Optimization: It continuously collects and learns the latest medical research to improve model performance and system functions.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

model_path = "WiNGPT2-7B-Chat"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()

generation_config = GenerationConfig(
  num_beams=1,
  top_p=0.75,
  top_k=30,
  repetition_penalty=1.1,
  max_new_tokens=1024
)

text = 'User: WiNGPT, 你好<|endoftext|>\n Assistant: '
inputs = tokenizer.encode(text, return_tensors="pt").to(device)
outputs = model.generate(inputs, generation_config=generation_config)
output = tokenizer.decode(outputs[0])
response = output.replace(inputs, '')

## 输出结果：你好！今天我能为你做些什么？<|endoftext|>

Advanced Usage

WiNGPT2 - 7B - Chat uses a custom prompt format:

User roles: User/Assistant

Prompt template: User: [space here]WiNGPT, 你好<|endoftext|>\n [space here]Assistant:; For multi - turn dialogue, splice according to this template, for example:

"User: WiNGPT, 你好<|endoftext|>\n Assistant:你好！今天我能为你做些什么？<|endoftext|>\n User: 你是谁？<|endoftext|>\n Assistant:"

It is recommended to use repetition_penalty = 1.1 [greedy search] when decoding.

Enterprise Service

13B model platform test (apply for the key directly)

📚 Documentation

Training Data

Data Overview

Medical Professional Data

Property	Details
Drug Instructions	Knowledge Base
Multi - disease Knowledge Base	Knowledge Base
Medical Professional Books	Textbooks
Clinical Pathway Knowledge Base	Knowledge Base
Examination and Inspection Knowledge	Knowledge Base
Multi - disciplinary Clinical Guidelines	Books
Medical Knowledge Graph	Knowledge Base
Manually Annotated Datasets	Instructions
Medical Qualification Examination Questions	Test Questions
Medical Cases and Reports	Knowledge Base

Other Public Data

Property	Details
Medical Popular Science Books	Books
Other Multi - disciplinary Books	Books
Code	Instructions
General Test Questions	Test Questions
Multiple Natural Language Processing Tasks	Instructions
Internet Texts	Internet
Medical Q&A and Dialogues	Instructions

Continued Pre - training

Expand the model's medical knowledge base: Pre - training data + part of the instruction data.

Instruction Fine - tuning

Automatically construct a medical instruction set from data such as books, guidelines, cases, medical reports, and knowledge graphs.
Manually annotate the instruction set. The data sources include electronic medical record systems, nursing medical record systems, PACS systems, clinical research systems, surgical management systems, public health scenarios, medical affairs management scenarios, and tool assistant scenarios.
Use solutions such as FastChat², Self - Instruct³, and Evol - Instruct⁴ to expand the instruction set and enrich its diversified forms.

Data Engineering

Data Classification: Classify according to the training stage and task scenario.
Data Cleaning: Remove irrelevant information, correct spelling mistakes in the data, extract key information, and perform privacy - removing processing.
Data Deduplication: Use the embedding method to remove duplicate data.
Data Sampling: Perform targeted sampling according to the quality and distribution requirements of the dataset.

Model Card

Training Configuration and Parameters

Property	Details
WiNGPT2 - 7B - Base	Length: 2048, Precision: bf16, Learning Rate: 5e - 5, Weight_decay: 0.05, Epochs: 3, GPUs: A100*8
WiNGPT2 - 7B - Chat	Length: 4096, Precision: bf16, Learning Rate: 5e - 6, Weight_decay: 0.01, Epochs: 3, GPUs: A100*8

Distributed Training Strategy and Parameters

deepspeed + cpu_offload + zero_stage3
gradient_checkpointing

Evaluation

Chinese Basic Model Evaluation C - EVAL (Zero - shot/Few - shot)

	Average	Average (Hard)	STEM	Social Sciences	Humanities	Others
[bloomz - mt - 176B](https://cevalbenchmark.com/static/model.html?method=bloomz - mt - 176B*)	44.3	30.8	39	53	47.7	42.7
[Chinese LLaMA - 13B](https://cevalbenchmark.com/static/model.html?method=Chinese%20LLaMA - 13B)	33.3	27.3	31.6	37.2	33.6	32.8
[ChatGLM - 6B](https://cevalbenchmark.com/static/model.html?method=ChatGLM - 6B)	38.9	29.2	33.3	48.3	41.3	38
[baichuan - 7B](https://cevalbenchmark.com/static/model.html?method=baichuan - 7B)	42.8	31.5	38.2	52	46.2	39.3
[Baichuan - 13B](https://cevalbenchmark.com/static/model.html?method=Baichuan - 13B)	53.6	36.7	47	66.8	57.3	49.8
[Qwen - 7B](https://cevalbenchmark.com/static/model.html?method=Qwen - 7B)	59.6	41	52.8	74.1	63.1	55.2
[WiNGPT2 - 7B - Base](https://huggingface.co/winninghealth/WiNGPT2 - 7B - Base)	57.4	42.7	53.2	69.7	55.7	55.4

Chinese Medical Professional Evaluation MedQA - MCMLE (Zero - shot)

Model Name	Average	Hematological Diseases	Metabolic and Endocrine System Diseases	Mental and Nervous System Diseases	Musculoskeletal Diseases	Rheumatic and Immune Diseases	Pediatric Diseases	Infectious and Sexually Transmitted Diseases	Other Diseases
[Baichuan - 7B](https://huggingface.co/baichuan - inc/Baichuan - 7B)	23.1	25.6	20.2	25.8	17.9	26.5	20.6	26.1	17.1
[Baichuan - 13B - Base](https://huggingface.co/baichuan - inc/Baichuan - 13B - Base)	37.2	34.4	36.2	40.7	38.4	57.1	31.6	30.8	34.3
[Baichuan2 - 7B - Base](https://huggingface.co/baichuan - inc/Baichuan2 - 7B - Base)	46.4	46.9	41.4	53.8	48.3	50.0	38.6	52.7	42.9
[Baichuan2 - 13B - Base](https://huggingface.co/baichuan - inc/Baichuan2 - 13B - Base)	62.9	68.8	64.4	69.7	64.9	60.3	50.9	61.2	62.9
[HuatuoGPT - 7B](https://huggingface.co/FreedomIntelligence/HuatuoGPT - 7B)	22.9	14.6	17.2	31.2	25.8	14.3	22.4	23.1	17.1
[MedicalGPT](https://huggingface.co/shibing624/vicuna - baichuan - 13b - chat)	17.9	21.9	15.5	19.5	9.3	7.1	16.7	20.9	9.5
[qwen - 7b - Base](https://huggingface.co/Qwen/Qwen - 7B)	59.3	55.2	56.9	57.0	60.9	60.3	50.4	60.4	61.0
[WiNGPT2 - 7B - Base](https://huggingface.co/winninghealth/WiNGPT2 - 7B - Base)	82.3	83.3	82.8	86.0	81.5	85.7	75.1	78.0	80

⚠️ Important Note

The current public evaluations have certain limitations, and the results are for reference only. More professional evaluations are coming soon.

🔧 Technical Details

WiNGPT2 is a large language model in the professional medical field. It can provide general users with anthropomorphic AI doctor consultations and Q&A functions, as well as knowledge Q&A in the general medical field. For professional medical personnel, the answers and suggestions provided by WiNGPT2 regarding patient diagnosis, medication, and health advice are for reference only.

You should understand that WiNGPT2 only provides information and suggestions and cannot replace the opinions, diagnoses, or treatment suggestions of medical professionals. Before using the information from WiNGPT2, please seek the advice of doctors or other medical professionals and independently evaluate the provided information.

The information in WiNGPT2 may be incorrect or inaccurate. Winning Health does not provide any express or implied guarantees regarding the accuracy, reliability, completeness, quality, safety, timeliness, performance, or applicability of WiNGPT2. You shall bear the results and decisions arising from the use of WiNGPT2. Winning Health shall not be responsible for any damages caused to you by third - party reasons.

📄 License

This project is licensed under the Apache License 2.0. The model weights need to comply with the relevant agreements and [license](https://github.com/QwenLM/Qwen - 7B/blob/main/LICENSE) of the base model [Qwen - 7B](https://github.com/QwenLM/Qwen - 7B). Please refer to its website for detailed content.
Please cite this project when using this project, including the model weights: https://github.com/winninghealth/WiNGPT2

🔗 References

https://github.com/QwenLM/Qwen - 7B
https://github.com/lm - sys/FastChat
https://github.com/yizhongw/self - instruct
https://github.com/nlpxucan/evol - instruct

📞 Contact Us

Website: https://www.winning.com.cn Email: wair@winning.com.cn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご