WiNGPT2-14B-Base Open-source Medical Large Model - Free Medical Q&A, Diagnosis, and Knowledge Services Offered

Wingpt2 14B Base

Developed by winninghealth

WiNGPT2 is a large-scale medical vertical domain model based on the GPT architecture, dedicated to integrating professional medical knowledge, healthcare information, and data to provide intelligent medical Q&A, diagnostic support, and medical knowledge services for the industry.

Large Language Model

Transformers

ChineseOpen Source License:Apache-2.0 #Medical Q&A #Diagnostic Support #Medical Knowledge Graph

Downloads 16

Release Time : 12/12/2023

Model Overview

WiNGPT2 is a large language model focused on the medical field, aiming to enhance diagnostic efficiency and healthcare service quality through intelligent medical Q&A, diagnostic support, and medical knowledge services.

Model Features

Medical Q&A

Covers medical knowledge explanations on symptoms, treatments, medications, prevention, and examinations.

Text Understanding

Analyzes medical terminology and medical records to achieve key information extraction and classification.

Multi-turn Dialogue

Simulates professional roles like doctors for contextually coherent interactions.

Multi-task Support

Adapts to 32 medical tasks, covering 8 major scenarios and 18 sub-scenarios.

Accuracy

Trained on vast medical corpora to ensure high accuracy and low misdiagnosis risk.

Scenario-Specific

Optimized for real-world medical needs to enhance practical applicability.

Continuous Evolution

Dynamically incorporates the latest medical research findings to iteratively improve model capabilities.

Model Capabilities

Medical Q&A

Diagnostic Support

Medical Knowledge Services

Text Understanding

Multi-turn Dialogue

Multi-task Support

Use Cases

Medical Q&A

Symptom Explanation

Answers patients' questions about symptoms and provides preliminary diagnostic suggestions.

High-accuracy symptom explanations reduce the risk of misdiagnosis.

Medication Consultation

Provides drug usage instructions, side effects, and interaction information.

Covers 15,000 drug instructions, offering comprehensive medication information.

Diagnostic Support

Disease Diagnosis

Provides possible disease diagnostic suggestions based on patient symptoms and medical history.

Covers 9,720 disease knowledge points, offering precise diagnostic support.

Test/Examination Recommendations

Recommends appropriate test/examination items based on patient symptoms.

Covers 2,800+ test/examination knowledge points, providing scientific recommendations.

Medical Knowledge Services

Clinical Guideline Query

Provides the latest clinical guidelines and diagnostic standards.

Covers 1,100 clinical guidelines to ensure authoritative information.

Medical Knowledge Graph

Constructs and queries medical knowledge graphs to support complex medical relationship reasoning.

Contains 2.56 million triples, enabling in-depth knowledge mining.

🚀 WiNGPT2

WiNGPT is a GPT-based large model tailored for the medical vertical domain. It aims to integrate professional medical knowledge, healthcare information, and data, offering intelligent information services such as medical Q&A, diagnostic support, and medical knowledge for the healthcare industry. This helps improve the efficiency of diagnosis and treatment and the quality of medical services.

🚀 Quick Start

WiNGPT (Winning Healthcare's large medical language model, hereinafter referred to as WiNGPT) commenced its R & D and training in January 2023.

In March, Winning Healthcare's AI laboratory completed the feasibility verification of WiNGPT - 001 and initiated internal testing. WiNGPT - 001 adopted a general GPT architecture with 6 billion parameters and achieved full - process self - development from pre - training to fine - tuning.

By May this year, the training data volume of WiNGPT - 001 reached 9,720 items of drug knowledge, 18 drug types, over 7,200 items of disease knowledge, over 2,800 items of inspection and testing knowledge, knowledge from 53 books, and more than 1,100 guideline documents. The total number of training tokens reached 3.7 billion.

In July, WiNGPT was upgraded to a 7B model with the latest model architecture, adding retrieval - enhanced generation capabilities. Meanwhile, the training of the 13B model began, and industry invitation testing was launched.

In September, WiNGPT underwent its latest version iteration, introducing the brand - new WiNGPT2. The new version can be easily extended and customized for various downstream application scenarios.

To give back to the open - source community, we open - sourced the WiNGPT2 - 7B/14B versions. Our intention is to accelerate the joint development of large medical language model technology and the industry through more open - source projects, ultimately benefiting human health.

✨ Features

Core Functions

Medical Knowledge Q&A: It can answer questions related to medicine, health, diseases, etc., including but not limited to symptoms, treatments, drugs, prevention, and examinations.
Natural Language Understanding: It can understand medical texts such as medical terms and medical records, providing key information extraction and categorization.
Multi - turn Dialogue: It can play various medical professional roles, such as doctors, to have conversations with users and provide more accurate answers based on the context.
Multi - task Support: It supports 32 medical tasks, covering 18 sub - scenarios in eight major medical scenarios.

Model Architecture

It is a large language model based on Transformer with 7 billion/14 billion parameters. It uses RoPE relative position encoding, SwiGLU activation function, and RMSNorm. During training, Qwen - 7b¹ is used as the base pre - trained model.

Main Characteristics

High Accuracy: Trained on a large - scale medical corpus, it has high accuracy and a low probability of misdiagnosis.
Scenario - oriented: It is specifically optimized and customized for different medical scenarios and real - world needs, facilitating better application implementation.
Iterative Optimization: It continuously collects and learns the latest medical research to improve model performance and system functionality.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

model_path = "WiNGPT2-7B-Chat"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()

generation_config = GenerationConfig(
  num_beams=1,
  top_p=0.75,
  top_k=30,
  repetition_penalty=1.1,
  max_new_tokens=1024
)

text = 'User: WiNGPT, 你好<|endoftext|>\n Assistant: '
inputs = tokenizer.encode(text, return_tensors="pt").to(device)
outputs = model.generate(inputs, generation_config=generation_config)
output = tokenizer.decode(outputs[0])
response = output.replace(inputs, '')

## Output result: 你好！今天我能为你做些什么？<|endoftext|>

Advanced Usage

WiNGPT2 - 7B - Chat uses a custom prompt format:

User roles: User/Assistant

Prompt template: User: [space here]WiNGPT, 你好<|endoftext|>\n [space here]Assistant:; For multi - turn dialogues, splice according to this template, for example:

"User: WiNGPT, 你好<|endoftext|>\n Assistant:你好！今天我能为你做些什么？<|endoftext|>\n User: 你是谁？<|endoftext|>\n Assistant:"

It is recommended to use repetition_penalty = 1.1 [greedy search] when decoding.

Enterprise Service

13B model platform test (apply for the key directly)

📚 Documentation

Training Data

Data Overview

Medical Professional Data | Property | Details | |----------|---------| | Drug Instructions | Knowledge Base | 15,000 items | | Multi - disease Knowledge Base | Knowledge Base | 9,720 items | | Medical Professional Books | Textbooks | 300 books | | Clinical Pathway Knowledge Base | Knowledge Base | 1,400 items | | Inspection and Testing Knowledge | Knowledge Base | 1.1 million items | | Multi - disciplinary Clinical Guidelines | Books | 1,100 copies from 18 departments | | Medical Knowledge Graph | Knowledge Base | 2.56 million triples | | Manually Annotated Datasets | Instructions | 50,000 items | | Medical Qualification Examination Questions | Test Questions | 300,000 items | | Medical Cases and Reports | Knowledge Base | 1 million items |
Other Public Data | Property | Details | |----------|---------| | Medical Popular Science Books | Books | 500 books | | Other Multi - disciplinary Books | Books | 1,000 books | | Code | Instructions | 200,000 items | | General Test Questions | Test Questions | 3 million items | | Various Natural Language Processing Tasks | Instructions | 900,000 items | | Internet Texts | Internet | 3 million items | | Medical Q&A and Dialogues | Instructions | 5 million items |

Continued Pre - training

Expand the model's medical knowledge base: Pre - training data + partial instruction data.

Instruction Fine - tuning

Automatically construct a medical instruction set from data such as books, guidelines, cases, medical reports, and knowledge graphs.
Manually annotate the instruction set. Data sources include electronic medical record systems, nursing medical record systems, PACS systems, clinical research systems, surgical management systems, public health scenarios, medical management scenarios, and tool assistant scenarios.
Use schemes such as FastChat², Self - Instruct³, and Evol - Instruct⁴ to expand the instruction set and enrich its diverse forms.

Data Engineering

Data Classification: Classify according to the training stage and task scenario.
Data Cleaning: Remove irrelevant information, correct spelling errors in the data, extract key information, and perform de - privacy processing.
Data Deduplication: Use the embedding method to remove duplicate data.
Data Sampling: Perform targeted sampling according to the quality and distribution requirements of the dataset.

Model Card

Training Configuration and Parameters

Property	Details
WiNGPT2 - 7B - Base	Length: 2048, Precision: bf16, Learning Rate: 5e - 5, Weight_decay: 0.05, Epochs: 3, GPUs: A100*8
WiNGPT2 - 7B - Chat	Length: 4096, Precision: bf16, Learning Rate: 5e - 6, Weight_decay: 0.01, Epochs: 3, GPUs: A100*8

Distributed Training Strategy and Parameters

deepspeed + cpu_offload + zero_stage3
gradient_checkpointing

Evaluation

Chinese Basic Model Evaluation C - EVAL (Zero - shot/Few - shot)

Model	Average	Average (Hard)	STEM	Social Sciences	Humanities	Others
[bloomz - mt - 176B](https://cevalbenchmark.com/static/model.html?method=bloomz - mt - 176B*)	44.3	30.8	39	53	47.7	42.7
[Chinese LLaMA - 13B](https://cevalbenchmark.com/static/model.html?method=Chinese%20LLaMA - 13B)	33.3	27.3	31.6	37.2	33.6	32.8
[ChatGLM - 6B](https://cevalbenchmark.com/static/model.html?method=ChatGLM - 6B)	38.9	29.2	33.3	48.3	41.3	38
[baichuan - 7B](https://cevalbenchmark.com/static/model.html?method=baichuan - 7B)	42.8	31.5	38.2	52	46.2	39.3
[Baichuan - 13B](https://cevalbenchmark.com/static/model.html?method=Baichuan - 13B)	53.6	36.7	47	66.8	57.3	49.8
[Qwen - 7B](https://cevalbenchmark.com/static/model.html?method=Qwen - 7B)	59.6	41	52.8	74.1	63.1	55.2
[WiNGPT2 - 7B - Base](https://huggingface.co/winninghealth/WiNGPT2 - 7B - Base)	57.4	42.7	53.2	69.7	55.7	55.4

Chinese Medical Professional Evaluation MedQA - MCMLE (Zero - shot)

Model Name	Average	Hematological Diseases	Metabolic and Endocrine System Diseases	Mental and Nervous System Diseases	Musculoskeletal Diseases	Rheumatological and Immunological Diseases	Pediatric Diseases	Infectious and Sexually Transmitted Diseases	Other Diseases
[Baichuan - 7B](https://huggingface.co/baichuan - inc/Baichuan - 7B)	23.1	25.6	20.2	25.8	17.9	26.5	20.6	26.1	17.1
[Baichuan - 13B - Base](https://huggingface.co/baichuan - inc/Baichuan - 13B - Base)	37.2	34.4	36.2	40.7	38.4	57.1	31.6	30.8	34.3
[Baichuan2 - 7B - Base](https://huggingface.co/baichuan - inc/Baichuan2 - 7B - Base)	46.4	46.9	41.4	53.8	48.3	50.0	38.6	52.7	42.9
[Baichuan2 - 13B - Base](https://huggingface.co/baichuan - inc/Baichuan2 - 13B - Base)	62.9	68.8	64.4	69.7	64.9	60.3	50.9	61.2	62.9
[HuatuoGPT - 7B](https://huggingface.co/FreedomIntelligence/HuatuoGPT - 7B)	22.9	14.6	17.2	31.2	25.8	14.3	22.4	23.1	17.1
[MedicalGPT](https://huggingface.co/shibing624/vicuna - baichuan - 13b - chat)	17.9	21.9	15.5	19.5	9.3	7.1	16.7	20.9	9.5
[qwen - 7b - Base](https://huggingface.co/Qwen/Qwen - 7B)	59.3	55.2	56.9	57.0	60.9	60.3	50.4	60.4	61.0
[WiNGPT2 - 7B - Base](https://huggingface.co/winninghealth/WiNGPT2 - 7B - Base)	82.3	83.3	82.8	86.0	81.5	85.7	75.1	78.0	80

**Currently, public evaluations have certain limitations, and the results are for reference only. ** More professional evaluations are forthcoming.

Limitations and Disclaimer

(a) WiNGPT2 is a large language model in the professional medical field. It can provide general users with anthropomorphic AI doctor consultations and Q&A functions, as well as knowledge Q&A in the general medical field. For professional medical personnel, the answers and suggestions provided by WiNGPT2 regarding patient diagnosis, medication, and health advice are for reference only.

(b) You should understand that WiNGPT2 only provides information and suggestions and cannot replace the opinions, diagnoses, or treatment suggestions of medical professionals. Before using the information from WiNGPT2, please seek advice from doctors or other medical professionals and independently evaluate the provided information.

(c) The information in WiNGPT2 may contain errors or inaccuracies. Winning Healthcare does not provide any express or implied warranties regarding the accuracy, reliability, completeness, quality, safety, timeliness, performance, or applicability of WiNGPT2. You are solely responsible for the results and decisions made using WiNGPT2. Winning Healthcare shall not be liable for any damages caused by third - party reasons.

📄 License

This project is licensed under the Apache License 2.0. The model weights need to comply with the relevant agreements and [license](https://github.com/QwenLM/Qwen - 7B/blob/main/LICENSE) of the base model [Qwen - 7B](https://github.com/QwenLM/Qwen - 7B). Refer to its website for detailed information.
When using this project, including the model weights, please cite this project: https://github.com/winninghealth/WiNGPT2

References

https://github.com/QwenLM/Qwen - 7B
https://github.com/lm - sys/FastChat
https://github.com/yizhongw/self - instruct
https://github.com/nlpxucan/evol - instruct

Contact Us

Website: https://www.winning.com.cn Email: wair@winning.com.cn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご