BioBERT Open-Source Language Model - Freely Support Biomedical Text Q&A System Applications

Biobert Large Cased V1.1 Squad

Developed by dmis-lab

BioBERT is a BERT-based pretrained language model specifically optimized for biomedical text mining tasks, suitable for question answering systems.

Question Answering System #Biomedical QA #PubMed Pretraining #Clinical Text Understanding

Downloads 1,227

Release Time : 3/2/2022

Model Overview

This model is a domain-specific model obtained by further pretraining on PubMed and PMC biomedical literature based on BERTBASE, particularly suitable for biomedical QA tasks.

Model Features

Biomedical Domain Optimization

Specially trained on PubMed and PMC biomedical literature corpus, with enhanced understanding of medical terminology and biomedical texts

QA System Specialization

Optimized specifically for QA tasks, capable of accurately understanding questions and extracting relevant answers from texts

Large-scale Pretraining

Additional 470K steps of biomedical domain pretraining on top of Wikipedia and BooksCorpus

Model Capabilities

Biomedical Text Understanding

Question Answering

Text Mining

Use Cases

Healthcare

Medical Literature QA

Extracting answers to specific questions from medical research papers

Clinical Decision Support

Assisting medical professionals in quickly accessing relevant medical knowledge

Academic Research

Biomedical Literature Mining

Extracting key information from large volumes of biomedical literature

🚀 BioBERT Large Cased v1.1 SQuAD Model Card

BioBERT Large Cased v1.1 SQuAD is a pre - trained biomedical language representation model designed for question - answering tasks, offering high - performance solutions in the biomedical text mining field.

🚀 Quick Start

Use the code below to get started with the model.

Click to expand

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-large-cased-v1.1-squad")

model = AutoModelForQuestionAnswering.from_pretrained("dmis-lab/biobert-large-cased-v1.1-squad")

✨ Features

Question - Answering Task: This model can be used for the task of question answering.

📚 Documentation

Model Details

Property	Details
Model Type	Question Answering
Developed by	DMIS - lab (Data Mining and Information Systems Lab, Korea University)
Shared by	DMIS - lab (Data Mining and Information Systems Lab, Korea University)
Parent Model	[gpt - neo - 2.7B](https://huggingface.co/EleutherAI/gpt - neo - 2.7B)
Resources for more information	GitHub Repo, Associated Paper

Uses

Direct Use

This model can be used for the task of question answering.

Out - of - Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl - long.330.pdf) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

The model creators note in the associated paper:

We used the BERTBASE model pre - trained on English Wikipedia and BooksCorpus for 1M steps. BioBERT v1.0 (þ PubMed þ PMC) is the version of BioBERT (þ PubMed þ PMC) trained for 470 K steps. When using both the PubMed and PMC corpora, we found that 200K and 270K pre - training steps were optimal for PubMed and PMC, respectively. We also used the ablated versions of BioBERT v1.0, which were pre - trained on only PubMed for 200K steps (BioBERT v1.0 (þ PubMed)) and PMC for 270K steps (BioBERT v1.0 (þ PMC))

Training Procedure

Preprocessing

The model creators note in the associated paper:

We pre - trained BioBERT using Naver Smart Machine Learning (NSML) (Sung et al., 2017), which is utilized for large - scale experiments that need to be run on several GPUs

Speeds, Sizes, Times

The model creators note in the associated paper:

The maximum sequence length was fixed to 512 and the mini - batch size was set to 192, resulting in 98 304 words per iteration.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type:
- Training: Eight NVIDIA V100 (32GB) GPUs [ for training],
- Fine - tuning: a single NVIDIA Titan Xp (12GB) GPU to fine - tune BioBERT on each task

Citation

BibTeX:

@misc{mesh-transformer-jax,
 @article{lee2019biobert,
  title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},
  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
  journal={arXiv preprint arXiv:1901.08746},
  year={2019}
}

More Information

For help or issues using BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee(lee.jnhk (at) gmail.com), or Wonjin Yoon (wonjin.info (at) gmail.com) for communication related to BioBERT.

Model Card Authors

DMIS - lab (Data Mining and Information Systems Lab, Korea University) in collaboration with Ezi Ozoani and the Hugging Face team

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご