Open-source MAIRA-2 model: Generate professional radiology reports from chest X-rays for free!

Maira 2

Developed by microsoft

MAIRA-2 is a multimodal transformer designed to generate evidence-based or non-evidence-based radiology reports from chest X-rays.

Image-to-Text

Transformers

Open Source License:Other #Chest X-ray report generation #Multimodal medical imaging #Evidence-based radiology

Downloads 46.44k

Release Time : 7/29/2024

Model Overview

MAIRA-2 consists of an image encoder RAD-DINO-MAIRA-2, a projection layer, and a language model vicuna-7b-v1.5, focusing on the task of generating reports for chest X-rays.

Model Features

Multimodal input support

Supports various input combinations, including current/historical X-rays, indication information, and technical descriptions

Evidence-based report generation

Can generate reports with image localization boxes, clearly marking the location of findings

Phrase localization ability

Can locate specific findings in the image based on text phrases

Model Capabilities

Medical image analysis

Radiology report generation

Image region localization

Multimodal understanding

Use Cases

Medical research

Automatic radiology report generation

Automatically generate radiology reports based on chest X-rays

Can generate narrative reports containing clinical findings

Evidence-based radiology findings

Generate radiology reports with image localization boxes

Can mark the specific image location of findings

Medical image phrase localization

Locate specific findings in the image based on text descriptions

Can return the bounding box coordinates of the described area

🚀 MAIRA-2

MAIRA-2 is a multimodal transformer designed for generating radiology reports from chest X-rays. It can generate both grounded and non-grounded reports, facilitating research in the medical imaging field.

🚀 Quick Start

MAIRA-2 is a research - only model. To start using it, you first need to set up the necessary environment and download the required packages. Here is a step - by - step guide:

Setup

To run the sample code, you need the following packages:

pillow
protobuf
sentencepiece
torch
transformers

Note: You may temporarily need to install transformers from source since MAIRA-2 requires transformers>=4.46.0.dev0. Due to an incompatible commit in transformers main, the current fix is to install a transformers version from or after commit 88d960937c81a32bfb63356a2e8ecf7999619681 but before commit 0f49deacbff3e57cde45222842c0db6375e4fa43.

pip install git+https://github.com/huggingface/transformers.git@88d960937c81a32bfb63356a2e8ecf7999619681

First, initialize the model and put it in eval mode.

from transformers import AutoModelForCausalLM, AutoProcessor
from pathlib import Path
import torch

model = AutoModelForCausalLM.from_pretrained("microsoft/maira-2", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/maira-2", trust_remote_code=True)

device = torch.device("cuda")
model = model.eval()
model = model.to(device)

We need to get some data to demonstrate the forward pass. For this example, we'll collect an example from the IU X - ray dataset, which has a permissive license.

import requests
from PIL import Image

def get_sample_data() -> dict[str, Image.Image | str]:
    """
    Download chest X-rays from IU-Xray, which we didn't train MAIRA-2 on. License is CC.
    We modified this function from the Rad-DINO repository on Huggingface.
    """
    frontal_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-1001.png"
    lateral_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-2001.png"

    def download_and_open(url: str) -> Image.Image:
        response = requests.get(url, headers={"User-Agent": "MAIRA-2"}, stream=True)
        return Image.open(response.raw)

    frontal_image = download_and_open(frontal_image_url)
    lateral_image = download_and_open(lateral_image_url)

    sample_data = {
        "frontal": frontal_image,
        "lateral": lateral_image,
        "indication": "Dyspnea.",
        "comparison": "None.",
        "technique": "PA and lateral views of the chest.",
        "phrase": "Pleural effusion."  # For the phrase grounding example. This patient has pleural effusion.
    }
    return sample_data

sample_data = get_sample_data()

✨ Features

Multimodal Capability: MAIRA-2 combines an image encoder, a projection layer, and a language model to generate radiology reports from chest X - rays.
Grounded and Non - grounded Report Generation: It can generate reports with or without spatial annotations (grounding).
Phrase Grounding: The model can localize a given phrase in the chest X - ray image.

📦 Installation

To install the necessary packages for running MAIRA-2, follow these steps:

Install the basic required packages:

pillow
protobuf
sentencepiece
torch
transformers

Install transformers from source due to specific version requirements:

pip install git+https://github.com/huggingface/transformers.git@88d960937c81a32bfb63356a2e8ecf7999619681

💻 Usage Examples

Basic Usage

Use - case 1 and 2: Findings generation with or without grounding

processed_inputs = processor.format_and_preprocess_reporting_input(
    current_frontal=sample_data["frontal"],
    current_lateral=sample_data["lateral"],
    prior_frontal=None,  # Our example has no prior
    indication=sample_data["indication"],
    technique=sample_data["technique"],
    comparison=sample_data["comparison"],
    prior_report=None,  # Our example has no prior
    return_tensors="pt",
    get_grounding=False,  # For this example we generate a non-grounded report
)

processed_inputs = processed_inputs.to(device)
with torch.no_grad():
    output_decoding = model.generate(
        **processed_inputs,
        max_new_tokens=300,  # Set to 450 for grounded reporting
        use_cache=True,
    )
prompt_length = processed_inputs["input_ids"].shape[-1]
decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
decoded_text = decoded_text.lstrip()  # Findings generation completions have a single leading space
prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)
print("Parsed prediction:", prediction)

We get something that looks like this:

There is a large right pleural effusion with associated right basilar atelectasis. The left lung is clear. No pneumothorax is identified. The cardiomediastinal silhouette and hilar contours are normal. There is no free air under the diaphragm. Surgical clips are noted in the right upper quadrant of the abdomen.

If we had set get_grounding=True, MAIRA-2 would generate a grounded report. For this example, that looks like this:

('There is a large right pleural effusion.', [(0.055, 0.275, 0.445, 0.665)]),
('The left lung is clear.', None),
('No pneumothorax is identified.', None),
('The cardiomediastinal silhouette is within normal limits.', None),
('The visualized osseous structures are unremarkable.', None)

Advanced Usage

Use - case 3: Phrase Grounding

processed_inputs = processor.format_and_preprocess_phrase_grounding_input(
    frontal_image=sample_data["frontal"],
    phrase=sample_data["phrase"],
    return_tensors="pt",
)

processed_inputs = processed_inputs.to(device)
with torch.no_grad():
    output_decoding = model.generate(
        **processed_inputs,
        max_new_tokens=150,
        use_cache=True,
    )
prompt_length = processed_inputs["input_ids"].shape[-1]
decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)

print("Parsed prediction:", prediction)

This gives us something like this:

('Pleural effusion.', [(0.025, 0.345, 0.425, 0.575)])

📚 Documentation

Model Details

MAIRA-2 is composed of the image encoder RAD - DINO - MAIRA - 2 (used frozen), a projection layer (trained from scratch), and the language model vicuna - 7b - v1.5 (fully fine - tuned).

Property	Details
Model Type	Multimodal transformer
Developed by	Microsoft Research Health Futures
Language(s) (NLP)	English
License	MSRLA
Finetuned from model [optional]	vicuna - 7b - 1.5, RAD - DINO - MAIRA - 2

Uses

MAIRA-2 is shared for research purposes only and is not meant to be used for clinical practice.

Direct Use

As inputs, MAIRA-2 takes a frontal chest X - ray, and any of the following:

A lateral view from the current study
A frontal view from the prior study, with accompanying prior report
The indication for the current study
The technique and comparison sections for the current study

MAIRA-2 can generate the findings section of the current study, in one of two forms:

Narrative text, without any image annotations (this is the typical report generation scenario).
As a grounded report, wherein all described findings are accompanied by zero or more bounding boxes indicating their location on the current frontal image.

MAIRA-2 can also perform phrase grounding. In this case, it must also be provided with an input phrase. It will then repeat the phrase and generate a bounding box localising the finding described in the phrase.

Out - of - Scope Use

MAIRA-2 was trained on chest X - rays from adults with English language reports only, and is not expected to work on any other imaging modality or anatomy. Variations in the input prompt (e.g., changing the instruction) are likely to degrade performance, as this model was not optimised for arbitrary user inputs.

As above, this is a research model which should not be used in any real clinical or production scenario.

Bias, Risks, and Limitations

Data biases

MAIRA-2 was trained on chest X - ray report datasets from Spain (translated from the original Spanish to English) and the USA. Reporting styles, patient demographics and disease prevalence, and image acquisition protocols can vary across health systems and regions. These factors will impact the generalisability of the model.

Model errors (fabrication, omission)

This model does not perform perfectly on its tasks, as outlined in more detail in the MAIRA - 2 report. Hence, errors can be present in the generated (grounded) reports.

Training details

We did not originally train MAIRA-2 using the exact model class provided here, however we have checked that its behaviour is the same. We provide this class to facilitate research re - use and inference.

Training data

MAIRA-2 was trained on a mix of public and private chest X - ray datasets. Each example comprises one or more CXR images and associated report text, with or without grounding (spatial annotations). The model is trained to generate the findings section of the report, with or without grounding.

Dataset	Country	# examples (ungrounded)	# examples (grounded)
MIMIC - CXR	USA	55 218	595*
PadChest	Spain	52 828	3 122
USMix (Private)	USA	118 031	53 613

*We use the [MS - CXR](https://physionet.org/content/ms - cxr/) phrase grounding dataset to provide `grounding' examples from MIMIC - CXR.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Property	Details
Hardware Type	NVIDIA A100 GPUs
Hours used	1432
Cloud Provider	Azure
Compute Region	West US 2
Carbon Emitted	107.4 CO₂ eq (ostensibly offset by this provider)

🔧 Technical Details

MAIRA-2's architecture combines a frozen image encoder, a newly trained projection layer, and a fully fine - tuned language model. The image encoder extracts features from the chest X - rays, the projection layer maps these features to a suitable space, and the language model generates the radiology reports based on the input features.

📄 License

The model is licensed under MSRLA.

📖 Citation

BibTeX:

@article{Bannur2024MAIRA2GR,
  title={MAIRA-2: Grounded Radiology Report Generation},
  author={Shruthi Bannur and Kenza Bouzid and Daniel C. Castro and Anton Schwaighofer and Anja Thieme and Sam Bond-Taylor and Maximilian Ilse and Fernando P\'{e}rez-Garc\'{i}a and Valentina Salvatelli and Harshita Sharma and Felix Meissen and Mercy Prasanna Ranjit and Shaury Srivastav and Julia Gong and Noel C. F. Codella and Fabian Falck and Ozan Oktay and Matthew P. Lungren and Maria T. A. Wetscherek and Javier Alvarez-Valle and Stephanie L. Hyland},
  journal={arXiv},
  year={2024},
  volume={abs/2406.04449},
  url={https://arxiv.org/abs/2406.04449}
}

APA:

Bannur*, S., Bouzid*, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond - Taylor, S., Ilse, M., Pérez - García, F., Salvatelli, V., Sharma, H., Meissen, F., Ranjit, M.P., Srivastav, S., Gong, J., Codella, N.C.F., Falck, F., Oktay, O., Lungren, M.P., Wetscherek, M.T., Alvarez - Valle, J., & Hyland, S. L. (2024). MAIRA - 2: Grounded Radiology Report Generation. arXiv preprint abs/2406.04449.

📞 Model Card Contact

Stephanie Hyland (stephanie.hyland@microsoft.com)
Shruthi Bannur (shruthi.bannur@microsoft.com)

⚠️ Important Note

MAIRA-2 is a research - only model and is not intended for clinical use. It has not been extensively tested for all aspects such as accuracy, reliability, fairness, and security.

💡 Usage Tip

When using MAIRA-2 for grounded reporting, remember that the generated bounding box coordinates are relative to the cropped image. Use processor.adjust_box_for_original_image_size to get boxes adjusted for the original image shape.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご