BiomedVLP-CXR-BERT-specialized Open-source Model - Optimizes Chest X-ray Analysis with Excellent Performance

Biomedvlp CXR BERT Specialized

Developed by microsoft

A language model optimized for the chest X-ray domain, achieving superior performance through improved vocabulary, innovative pre-training processes, and text enhancement techniques

Multimodal Alignment

Transformers

EnglishOpen Source License:MIT #Chest X-ray Multimodal #Radiology Semantic Modeling #Clinical VLP Optimization

Downloads 35.69k

Release Time : 5/11/2022

Model Overview

CXR-BERT is a language model specifically designed for the chest X-ray domain, excelling in radiology natural language inference and vision-language tasks through multi-stage pre-training and contrastive learning

Model Features

Domain-Specific Vocabulary

Vocabulary optimized for the chest X-ray domain, reducing token count after segmentation by approximately 1.59%

Multi-Stage Pre-Training

Initial pre-training with biomedical literature and clinical notes, followed by continued pre-training in the chest X-ray domain

Multimodal Contrastive Learning

Adopts a CLIP-like framework to achieve text/image embedding alignment

High Performance

Achieves 65.21% accuracy in RadNLI tasks, significantly outperforming ClinicalBERT and PubMedBERT

Model Capabilities

Radiology text semantic understanding

Medical image-text joint representation learning

Medical phrase localization

Radiology report generation

Use Cases

Medical Research

Radiology Natural Language Inference

Determining the consistency of statements in radiology reports

Achieves 65.21% accuracy in RadNLI tasks

Medical Phrase Localization

Locating lesion areas described in text within X-ray images

Achieves a CNR score of 1.142 on the MS-CXR benchmark

Clinical Assistance

Radiology Report Assisted Generation

Generating preliminary diagnostic reports based on X-ray images

🚀 CXR-BERT-specialized

CXR-BERT-specialized is a chest X-ray domain-specific language model. It enhances performance in radiology natural language inference, masked language model token prediction, and downstream vision-language processing tasks.

🚀 Quick Start

CXR-BERT is a chest X-ray (CXR) domain-specific language model. It uses an improved vocabulary, novel pretraining procedure, weight regularization, and text augmentations. The model shows better performance in radiology natural language inference, radiology masked language model token prediction, and downstream vision-language processing tasks like zero-shot phrase grounding and image classification.

First, we pretrain CXR-BERT-general from a randomly initialized BERT model via Masked Language Modeling (MLM) on abstracts from PubMed and clinical notes from the publicly-available MIMIC-III and MIMIC-CXR. The general model is expected to be applicable for research in clinical domains other than chest radiology through domain-specific fine-tuning.

CXR-BERT-specialized is continually pretrained from CXR-BERT-general to further specialize in the chest X-ray domain. At the final stage, CXR-BERT is trained in a multi-modal contrastive learning framework, similar to the CLIP framework. The latent representation of the [CLS] token is used to align text/image embeddings.

✨ Features

Model variations

Property	Details
Model Type	There are two variations: CXR-BERT-general and CXR-BERT-specialized (after multi-modal training).
Model identifier on HuggingFace	CXR-BERT-general: microsoft/BiomedVLP-CXR-BERT-general; CXR-BERT-specialized: microsoft/BiomedVLP-CXR-BERT-specialized
Vocabulary	Both use PubMed & MIMIC.
Note	CXR-BERT-general is pretrained for biomedical literature and clinical domains; CXR-BERT-specialized is pretrained for the chest X-ray domain.

Image model

CXR-BERT-specialized is jointly trained with a ResNet-50 image model in a multi-modal contrastive learning framework. Before multi-modal learning, the image model is pre-trained on the same set of images in MIMIC-CXR using SimCLR. The corresponding model definition and its loading functions can be accessed through our HI-ML-Multimodal GitHub repository. The joint image and text model, namely BioViL, can be used in phrase grounding applications as shown in this python notebook example. Additionally, please check the MS-CXR benchmark for a more systematic evaluation of joint image and text models in phrase grounding tasks.

💻 Usage Examples

Basic Usage

Here is how to use this model to extract radiological sentence embeddings and obtain their cosine similarity in the joint space (image and text):

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
url = "microsoft/BiomedVLP-CXR-BERT-specialized"
tokenizer = AutoTokenizer.from_pretrained(url, trust_remote_code=True)
model = AutoModel.from_pretrained(url, trust_remote_code=True)

# Input text prompts (e.g., reference, synonym, contradiction)
text_prompts = ["There is no pneumothorax or pleural effusion",
                "No pleural effusion or pneumothorax is seen",
                "The extent of the pleural effusion is constant."]

# Tokenize and compute the sentence embeddings
tokenizer_output = tokenizer.batch_encode_plus(batch_text_or_text_pairs=text_prompts,
                                               add_special_tokens=True,
                                               padding='longest',
                                               return_tensors='pt')
embeddings = model.get_projected_text_embeddings(input_ids=tokenizer_output.input_ids,
                                                 attention_mask=tokenizer_output.attention_mask)

# Compute the cosine similarity of sentence embeddings obtained from input text prompts.
sim = torch.mm(embeddings, embeddings.t())

📚 Documentation

Model Use

Intended Use

This model is intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper.

Primary Intended Use

The primary intended use is to support AI researchers building on top of this work. CXR-BERT and its associated models should be helpful for exploring various clinical NLP & VLP research questions, especially in the radiology domain.

Out-of-Scope Use

Any deployed use case of the model --- commercial or otherwise --- is currently out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are not intended for deployed use cases. Please refer to the associated paper for more details.

Data

This model builds upon existing publicly-available datasets:

These datasets reflect a broad variety of sources ranging from biomedical abstracts to intensive care unit notes to chest X-ray radiology notes. The radiology notes are accompanied with their associated chest x-ray DICOM images in the MIMIC-CXR dataset.

Performance

We demonstrate that this language model achieves state-of-the-art results in radiology natural language inference through its improved vocabulary and novel language pretraining objective leveraging semantics and discourse characteristics in radiology reports.

A highlight of comparison to other common models, including ClinicalBERT and PubMedBERT:

Property	RadNLI accuracy (MedNLI transfer)	Mask prediction accuracy	Avg. # tokens after tokenization	Vocabulary size
RadNLI baseline	53.30	-	-	-
ClinicalBERT	47.67	39.84	78.98 (+38.15%)	28,996
PubMedBERT	57.71	35.24	63.55 (+11.16%)	28,895
CXR-BERT (after Phase-III)	60.46	77.72	58.07 (+1.59%)	30,522
CXR-BERT (after Phase-III + Joint Training)	65.21	81.58	58.07 (+1.59%)	30,522

CXR-BERT also contributes to better vision-language representation learning through its improved text encoding capability. Below is the zero-shot phrase grounding performance on the MS-CXR dataset, which evaluates the quality of image-text latent representations.

Vision–Language Pretraining Method	Text Encoder	MS-CXR Phrase Grounding (Avg. CNR Score)
Baseline	ClinicalBERT	0.769
Baseline	PubMedBERT	0.773
ConVIRT	ClinicalBERT	0.818
GLoRIA	ClinicalBERT	0.930
BioViL	CXR-BERT	1.027
BioViL-L	CXR-BERT	1.142

Limitations

This model was developed using English corpora, and thus can be considered English-only.

Further information

Please refer to the corresponding paper, "Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing", ECCV'22 for additional details on the model training and evaluation.

For additional inference pipelines with CXR-BERT, please refer to the HI-ML-Multimodal GitHub repository.

📄 License

This model is released under the MIT license.

📖 Citation

The corresponding manuscript is accepted to be presented at the European Conference on Computer Vision (ECCV) 2022

@misc{https://doi.org/10.48550/arxiv.2204.09817,
  doi = {10.48550/ARXIV.2204.09817},
  url = {https://arxiv.org/abs/2204.09817},
  author = {Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan},
  title = {Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing},
  publisher = {arXiv},
  year = {2022},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご