ViHealthBERT Open-source Language Model - Free for Vietnamese Medical and Health Text Mining

Vihealthbert Base Word

Developed by demdecuong

ViHealthBERT is a pre-trained language model for Vietnamese health text mining, providing strong baseline performance in the healthcare domain

Large Language Model

Transformers

#Vietnamese medical text processing #Pre-trained language models #Named entity recognition

Downloads 633

Release Time : 4/20/2022

Model Overview

A pre-trained language model specifically designed for Vietnamese medical and health texts, supporting tasks such as named entity recognition, abbreviation disambiguation, and text summarization

Model Features

Medical domain optimization

Specially pre-trained for Vietnamese medical and health texts, excelling in related tasks

Dual tokenizer support

Provides both word-level and syllable-level tokenizer versions to adapt to different application scenarios

Accompanying datasets

Includes released medical abbreviation dataset (acrDrAid) and frequently asked questions summarization dataset

Model Capabilities

Vietnamese medical text understanding

Named entity recognition

Abbreviation disambiguation

Text summarization generation

Use Cases

Medical information processing

COVID-19 entity recognition

Identifying COVID-19 related entities from Vietnamese medical texts

Achieved SOTA performance on the COVID-19 & ViMQ dataset

Medical abbreviation resolution

Resolving professional abbreviations in Vietnamese medical documents

Excellent performance on the acrDrAid dataset

Medical text summarization

FAQ summarization

Generating concise summaries of frequently asked medical questions in Vietnamese

🚀 ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

ViHealthBERT is a powerful baseline language model for Vietnamese in the healthcare domain. It empirically explores different training strategies and achieves state-of-the-art (SOTA) performance on three downstream tasks: Named Entity Recognition (NER) for COVID-19 and ViMQ, Acronym Disambiguation, and Summarization.

This project also introduces two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. The acrDrAid dataset is annotated with 135 sets of keywords.

The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster paper (updated soon):

@article{vihealthbert,
    title     = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}},
    author    = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong },
    journal   = {13th Edition of its Language Resources and Evaluation Conference},
    year      = {2022}
}

🚀 Quick Start

📦 Installation

Python 3.6+ and PyTorch >= 1.6
Install transformers:

pip install transformers==4.2.0

📚 Pre-trained models

Property	Details
Model	`demdecuong/vihealthbert-base-word`, `demdecuong/vihealthbert-base-syllable`
#params	135M
Arch.	base
Tokenizer	Word-level, Syllable-level

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModel, AutoTokenizer

vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word")
tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
    features = vihealthbert(input_ids)  # Models outputs are now tuples

Advanced Usage

Since ViHealthBERT used the RDRSegmenter from VnCoreNLP to pre-process the pre-training data, we highly recommend using the same word-segmenter for ViHealthBERT downstream applications.

Installation

# Install the vncorenlp python wrapper
pip3 install vncorenlp

# Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter) 
mkdir -p vncorenlp/models/wordsegmenter
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
mv VnCoreNLP-1.1.1.jar vncorenlp/ 
mv vi-vocab vncorenlp/models/wordsegmenter/
mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/

Note that VnCoreNLP-1.1.1.jar (27MB) and the models/ folder must be placed in the same working folder.

Example usage

# See more details at: https://github.com/vncorenlp/VnCoreNLP

# Load rdrsegmenter from VnCoreNLP
from vncorenlp import VnCoreNLP
rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m') 

# Input 
text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

# To perform word (and sentence) segmentation
sentences = rdrsegmenter.tokenize(text) 
for sentence in sentences:
    print(" ".join(sentence))

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご