BioRedditBERT-uncased Open-source Model - Free Processing of Social Media Medical Texts, Accurately Assisting in Health Analysis

Bioredditbert Uncased

Developed by cambridgeltl

A BERT model initialized with BioBERT and further pretrained on health-related Reddit posts, specializing in processing medical texts from social media

Large Language Model English#Medical Social Media Analysis #Biomedical NLP #Reddit Health Text

Downloads 295

Release Time : 3/2/2022

Model Overview

BioRedditBERT is a pretrained language model optimized for medical texts on social media. It enhances BioBERT's domain adaptability through Reddit health discussion data, demonstrating excellent performance in medical entity linking tasks

Model Features

Optimized for Medical Social Media Text

Enhances model's understanding of informal medical texts through Reddit health discussion data

Biomedical Domain Adaptation

Initialized with BioBERT, retaining professional medical knowledge comprehension

Superior Entity Linking Performance

Achieves state-of-the-art performance on AskAPatient dataset (Accuracy@1 reaches 44.3%)

Model Capabilities

Medical entity recognition

Informal medical text understanding

Biomedical concept linking

Social media text analysis

Use Cases

Medical Information Processing

Patient Forum Entity Standardization

Linking informal symptom descriptions in patient discussions to standard medical terminology

Accuracy@1 reaches 44.3%, outperforming other biomedical BERT variants

Social Media Health Monitoring

Analyzing health discussion content on platforms like Reddit

🚀 BioRedditBERT

BioRedditBERT is a BERT - based model pre - trained on health - related Reddit posts, offering enhanced performance in the medical entity linking task in the social media domain.

🚀 Quick Start

For a detailed understanding of BioRedditBERT, please refer to our paper COMETA: A Corpus for Medical Entity Linking in the Social Media (EMNLP 2020).

✨ Features

BioRedditBERT is initialized from BioBERT (BioBERT - Base v1.0 + PubMed 200K + PMC 270K) and further pre - trained on health - related Reddit posts, which enables it to better handle medical entity linking tasks in the social media domain.

📦 Installation

The README does not provide installation steps, so this section is skipped.

📚 Documentation

Model description

BioRedditBERT is a BERT model initialised from BioBERT (BioBERT - Base v1.0 + PubMed 200K + PMC 270K) and further pre - trained on health - related Reddit posts. Please view our paper COMETA: A Corpus for Medical Entity Linking in the Social Media (EMNLP 2020) for more details.

Training data

We crawled all threads from 68 health themed subreddits such as r/AskDocs, r/health and etc. starting from the beginning of 2015 to the end of 2018, obtaining a collection of more than 800K discussions. This collection was then pruned by removing deleted posts, comments from bots or moderators, and so on. In the end, we obtained the training corpus with ca. 300 million tokens and a vocabulary size of ca. 780,000 words.

Training procedure

We use the same pre - training script in the original [google - research/bert](https://github.com/google - research/bert) repo. The model is initialised with [BioBERT - Base v1.0 + PubMed 200K + PMC 270K](https://github.com/dmis - lab/biobert). We train with a batch size of 64, a max sequence length of 64, a learning rate of 2e - 5 for 100k steps on two GeForce GTX 1080Ti (11 GB) GPUs. Other hyper - parameters are the same as default.

Eval results

To show the benefit from further pre - training on the social media domain, we demonstrate results on a medical entity linking dataset also in the social media: AskAPatient [(Limsopatham and Collier 2016)](https://www.aclweb.org/anthology/P16 - 1096.pdf). We follow the same 10 - fold cross - validation procedure for all models and report the average result without fine - tuning. [CLS] is used as representations for entity mentions (we also tried average of all tokens but found [CLS] generally performs better).

Property	Details
Model Type	BioRedditBERT (a BERT - based model)
Training Data	Crawled from 68 health - themed subreddits from 2015 - 2018, about 300 million tokens and 780,000 words vocabulary

Model	Accuracy@1	Accuracy@5
[BERT - base - uncased](https://huggingface.co/bert - base - uncased)	38.2	43.3
[BioBERT v1.1](https://huggingface.co/dmis - lab/biobert - v1.1)	41.4	51.5
ClinicalBERT	43.9	54.3
[BlueBERT](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI - BERT/NCBI_BERT_pubmed_mimic_uncased_L - 12_H - 768_A - 12.zip)	41.5	48.5
SciBERT	42.3	51.9
[PubMedBERT](https://huggingface.co/microsoft/BiomedNLP - PubMedBERT - base - uncased - abstract - fulltext)	42.5	49.6
BioRedditBERT	44.3	56.2

BibTeX entry and citation info

@inproceedings{basaldella-2020-cometa,
    title = "{COMETA}: A Corpus for Medical Entity Linking in the Social Media",
    author = "Basaldella, Marco  and Liu, Fangyu, and Shareghi, Ehsan, and Collier, Nigel",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2020",
    publisher = "Association for Computational Linguistics"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご