The open-source model llava-med-7b-delta - Free deployment to boost biomedical image and text processing

Llava Med 7b Delta

Developed by microsoft

LLaVA-Med is a biomedical multimodal model constructed through visual instruction fine-tuning, capable of processing biomedical images and text.

Text-to-Image

Transformers

Open Source License:Other #Biomedical Visual Question Answering #Multimodal Instruction Fine-tuning #Curriculum Learning Training

Downloads 257

Release Time : 11/9/2023

Model Overview

LLaVA-Med is a biomedical vision-language model initialized from LLaVA, fine-tuned on biomedical data through curriculum learning, focusing on biomedical visual question answering and dialogue tasks.

Model Features

Biomedical Domain Adaptation

Optimized specifically for the biomedical domain through curriculum learning

Multimodal Capability

Simultaneously processes biomedical images and related textual information

Research Only

Focused on biomedical research applications, not suitable for clinical decision-making

Model Capabilities

Biomedical Image Understanding

Biomedical Text Understanding

Visual Question Answering

Multimodal Dialogue

Use Cases

Medical Research

Biomedical Literature Analysis

Analyzing charts and textual content in medical literature

Performs excellently on benchmarks like PathVQA and VQA-RAD

Medical Education

Assisting in understanding visual content for medical education

🚀 LLaVA-Med: Large Language and Vision Assistant for BioMedicine

Visual instruction tuning towards building large language and vision models with GPT-4 level capabilities in the biomedicine space.

[Paper, NeurIPS 2023 Datasets and Benchmarks Track (Spotlight)] | [LLaVA-Med Github Repository]

Chunyuan Li*, Cliff Wong*, Sheng Zhang*, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao (*Equal Contribution)

Generated by GLIGEN using the grounded inpainting mode, with three boxes: white doctor coat, stethoscope, white doctor hat with a red cross sign.

LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). We evaluated LLaVA-Med on standard visual conversation and question answering tasks.

⚠️ Important Note

This "delta model" cannot be used directly. Users have to apply it on top of the original LLaMA weights to get actual LLaVA weights.

💡 Usage Tip

The data, code, and model checkpoints are intended and licensed for research use only. They are also subject to additional restrictions dictated by the Terms of Use: LLaMA, Vicuna and GPT-4 respectively. The data is made available under CC BY NC 4.0. The data, code, and model checkpoints may be used for non-commercial purposes and any models trained using the dataset should be used only for research purposes. It is expressly prohibited for models trained on this data to be used in clinical care or for any clinical decision making purposes.

✨ Features

LLaVA-Med is a large language and vision model trained using a curriculum learning method for adapting LLaVA to the biomedical domain.
It is an open-source release intended for research use only to facilitate reproducibility of the corresponding paper, which claims improved performance for open-ended biomedical questions answering tasks, including common visual question answering (VQA) benchmark datasets such as PathVQA and VQA-RAD.

📦 Installation

Clone the LLaVA-Med Github repository and navigate to LLaVA-Med folder

https://github.com/microsoft/LLaVA-Med.git
cd LLaVA-Med

Install Package: Create conda environment

conda create -n llava-med python=3.10 -y
conda activate llava-med
pip install --upgrade pip  # enable PEP 660 support

Install additional packages for training cases

pip uninstall torch torchvision -y
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install openai==0.27.8
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers@cae78c46
pip install -e .

pip install einops ninja open-clip-torch
pip install flash-attn --no-build-isolation

💻 Usage Examples

Basic Usage

To get LLaVA-Med weights by applying the delta:

python3 -m llava.model.apply_delta \
    --base /path/to/llama-7b \
    --target /output/path/to/llava_med_in_text_60k \
    --delta path/to/llava_med_in_text_60k_delta

Advanced Usage

Medical Visual Chat (GPT-assisted Evaluation)

Generate LLaVA-Med responses

python model_vqa.py \
    --model-name ./checkpoints/LLaVA-7B-v0 \
    --question-file data/eval/llava_med_eval_qa50_qa.jsonl \
    --image-folder data/images/ \
    --answers-file /path/to/answer-file.jsonl

Evaluate the generated responses

python llava/eval/eval_multimodal_chat_gpt_score.py \
    --question_input_path data/eval/llava_med_eval_qa50_qa.jsonl \
    --input_path /path/to/answer-file.jsonl \
    --output_path /path/to/save/gpt4-eval-for-individual-answers.jsonl

Summarize the evaluation results

python summarize_gpt_review.py

Medical VQA

- Prepare Data

Please see VQA-Rad repo for setting up the dataset.
Generate VQA-Rad dataset for LLaVA-Med conversation-style format (the same format with instruct tuning). For each dataset, we process it into three components: train.json, test.json, images.

- Fine-tuning

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path /path/to/checkpoint_llava_med_instruct_60k_inline_mention \
    --data_path /path/to/eval/vqa_rad/train.json \
    --image_folder /path/to/eval/vqa_rad/images \
    --vision_tower openai/clip-vit-large-patch14 \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

- Evaluation

(a) Generate LLaVA responses on ScienceQA dataset (a.1). [Option 1] Multiple-GPU inference

python llava/eval/run_med_datasets_eval_batch.py --num-chunks 8  --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
    --question-file path/to/eval/vqa_rad/test.json \
    --image-folder path/to/eval/vqa_rad/images \
    --answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl

(a.2). [Option 2] Single-GPU inference

python llava/eval/model_vqa_med.py --model-name /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad \
    --question-file path/to/eval/vqa_rad/test.json \
    --image-folder path/to/eval/vqa_rad/images \
    --answers-file /path/to/checkpoint_llava_med_instruct_60k_inline_mention/eval/fine_tuned/vqa_rad/test-answer-file.jsonl

(b) Evaluate the generated responses (b.1). [Option 1] Evaluation for all three VQA datasets

python llava/eval/run_eval_batch.py \
    --pred_file_parent_path /path/to/llava-med \
    --target_test_type test-answer-file

📚 Documentation

Model Uses

Intended Use

The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.

Primary Intended Use

The primary intended use is to support AI researchers reproducing and building on top of this work. LLaVA-Med and its associated models should be helpful for exploring various biomedical vision-language processing (VLP ) and vision question answering (VQA) research questions.

Out-of-Scope Use

Any deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases. Please refer to the associated paper for more details.

Data

This model builds upon PMC-15M dataset, which is a large-scale parallel image-text dataset for biomedical vision-language processing. It contains 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. It covers a diverse range of biomedical image types, such as microscopy, radiography, histology, and more.

Limitations

This model was developed using English corpora, and thus may be considered English-only. This model is evaluated on a narrow set of biomedical benchmark tasks, described in LLaVA-Med paper. As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, this model is likely to carry many of the limitations of the model from which it is derived, LLaVA.

Further, this model was developed in part using the PMC-15M dataset. The figure-caption pairs that make up this dataset may contain biases reflecting the current practice of academic publication. For example, the corresponding papers may be enriched for positive findings, contain examples of extreme cases, and otherwise reflect distributions that are not representative of other sources of biomedical data.

📄 License

Code License: Microsoft Research License
Data License: CC By NC 4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご