GENIE_zh_7b Open-source Medical Text Processing Model - Free to Extract Biomedical Entities and Attributes from Medical Records

GENIE Zh 7b

Developed by THUMedInfo

GENIE is an end-to-end model designed to structure free text in Electronic Health Records (EHR), extracting biomedical named entities and their related attributes.

Large Language Model

Safetensors

ChineseOpen Source License:Apache-2.0 #Electronic Medical Record Structuring #Medical Information Extraction #End-to-End Processing

Downloads 76

Release Time : 11/19/2024

Model Overview

GENIE processes EHR in a single pass to extract biomedical named entities along with their assertion status, body parts, modifiers, values, units, and intended uses, outputting this information in a structured JSON format.

Model Features

End-to-End Processing

Replaces all analysis components with a single model, simplifying traditional NLP workflows.

No Prompt Engineering

Unlike general-purpose LLMs, GENIE does not require prompt engineering or few-shot examples.

Efficient Processing

Generates all relevant attributes in a single pass, significantly reducing runtime and operational costs.

Structured Output

Outputs extracted information in a structured JSON format for easy subsequent processing and analysis.

Model Capabilities

Electronic Health Record Structuring

Biomedical Named Entity Recognition

Assertion Status Extraction

Body Part Recognition

Modifier Extraction

Value and Unit Extraction

Intended Use Extraction

Use Cases

Healthcare

Electronic Health Record Analysis

Extracts structured information from free-text electronic health records, such as diseases, symptoms, diagnostic procedures, etc.

Outputs structured JSON containing terms, semantic types, narrative status, body parts, values, units, and modifiers.

Clinical Research Data Preparation

Prepares structured patient data for clinical research, facilitating statistical analysis and machine learning model training.

Provides high-quality annotated data, reducing manual labeling costs.

🚀 Model Card for GENIE

GENIE (Generative Note Information Extraction) is an end - to - end model tailored for structuring free text from electronic health records (EHRs). It simplifies the traditional NLP workflow and reduces runtime and operational costs.

🚀 Quick Start

✨ Features

Single - pass processing: GENIE processes EHRs in one go, extracting various biomedical information and outputting it in a structured JSON format.
No prompt engineering needed: Unlike general - purpose LLMs, it doesn't require prompt engineering or few - shot examples.
Cost - effective: It significantly cuts down both runtime and operational costs by generating all relevant attributes in a single pass.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from vllm import LLM, SamplingParams

PROMPT_TEMPLATE = "Human:\n{query}\n\n Assistant:\n"
sampling_params = SamplingParams(temperature=temperature, max_tokens=max_new_token)
EHR = ['xxxxx1','xxxxx2']
texts = [PROMPT_TEMPLATE.format(query=k) for k in EHR]
output = model.generate(texts, sampling_params)

Advanced Usage

Here is an example of input and output:

Input:

EHR = ['慢性乙型肝炎病史10余年，曾有肝功能异常，中医治疗后好转；1年余前查HBsAg转阴，但肝脏病理提示病毒性肝炎伴肝纤维化（G1S3-4）']

Output:

res = [
 { "术语": "慢性乙型肝炎",
 "语义类型": "疾病、综合征、病理功能",
 "叙述状态": "存在",
 "身体部位": "无",
 "数值": "NA",
 "单位": "NA",
 "修饰词": "无" },
 { "术语": "肝功能异常",
 "语义类型": "症状、体征、临床所见",
 "叙述状态": "存在",
 "身体部位": "无",
 "数值": "NA",
 "单位": "NA",
 "修饰词": "无" },
 { "术语": "HBsAg",
 "语义类型": "化学物质、药物",
 "叙述状态": "不存在",
 "身体部位": "NA",
 "数值": "无",
 "单位": "NA",
 "修饰词": "NA" },
{ "术语": "肝脏病理",
 "语义类型": "诊断操作",
 "叙述状态": "存在",
 "身体部位": "无",
 "数值": "无",
 "单位": "NA",
 "修饰词": "NA" },
 { "术语": "病毒性肝炎",
 "语义类型": "疾病、综合征、病理功能",
 "叙述状态": "存在",
 "身体部位": "无",
 "数值": "NA",
 "单位": "NA",
 "修饰词": "无" },
 { "术语": "肝纤维化",
 "语义类型": "疾病、综合征、病理功能",
 "叙述状态": "存在",
 "身体部位": "无",
 "数值": "NA",
 "单位": "NA",
 "修饰词": "无" },
]

📚 Documentation

Model Details

Property	Details
Model Size	7B
Max Tokens	8192
Base model	Qwen 2.5 7B

Model Description

GENIE (Generative Note Information Extraction, Chinese name: 病历精灵) is an end - to - end model designed to structure free text from electronic health records (EHRs). It processes EHRs in a single pass, extracting biomedical named entities along with their assertion statuses, body locations, modifiers, values, units, and intended purposes, outputting this information in a structured JSON format. This streamlined approach simplifies traditional natural language processing workflows by replacing all the analysis components with a single model, making the system easier to maintain while leveraging the advanced analytical capabilities of large language models (LLMs). Comparing with general - purpose LLMs, GENIE does not require prompt engineering or few - shot examples. Additionally, it generates all relevant attributes in one pass, significantly reducing both runtime and operational costs. GENIE is co - developed by the groups of Sheng Yu (https://www.stat.tsinghua.edu.cn/teachers/shengyu/), Tianxi Cai (https://dbmi.hms.harvard.edu/people/tianxi - cai), and Isaac Kohane (https://dbmi.hms.harvard.edu/people/isaac - kohane).

📄 License

The model is licensed under the Apache - 2.0 license.

📚 Citation

If you find our paper or models helpful, please consider citing:

BibTeX:

@misc{ying2025geniegenerativenoteinformation,
      title={GENIE: Generative Note Information Extraction model for structuring EHR data}, 
      author={Huaiyuan Ying and Hongyi Yuan and Jinsen Lu and Zitian Qu and Yang Zhao and Zhengyun Zhao and Isaac Kohane and Tianxi Cai and Sheng Yu},
      year={2025},
      eprint={2501.18435},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.18435}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご