đ Aloe: A Family of Fine-tuned Open Healthcare LLMs
Aloe is a family of fine - tuned open healthcare LLMs. These models achieve state - of - the - art performance on several medical tasks. They are available in multiple sizes and trained on diverse medical data, making them robust and versatile for healthcare applications.
đ Quick Start
You can start using the Aloe model with the following code examples. There are two ways: using the Transformers pipeline and the AutoModelForCausalLM
class.
đģ Usage Examples
Basic Usage (Transformers pipeline)
import transformers
import torch
model_id = "HPAI-BSC/Qwen2.5-Aloe-Beta-7B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are an expert medical assistant named Aloe, developed by the High Performance Artificial Intelligence Group at Barcelona Supercomputing Center(BSC). You are to be a helpful, respectful, and honest assistant."},
{"role": "user", "content": "Hello."},
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|im_end|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.7,
top_p=0.8,
top_k=20,
repetition_penalty=1.05
)
print(outputs[0]["generated_text"][len(prompt):])
Advanced Usage (Transformers AutoModelForCausalLM)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "HPAI-BSC/Qwen2.5-Aloe-Beta-7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are an expert medical assistant named Aloe, developed by the High Performance Artificial Intelligence Group at Barcelona Supercomputing Center(BSC). You are to be a helpful, respectful, and honest assistant."},
{"role": "user", "content": "Hello"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|im_end|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.7,
top_p=0.8,
top_k=20,
repetition_penalty=1.05
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
⨠Features
- Multiple Sizes: Aloe is available in four model sizes: 7B, 8B, 70B, and 72B.
- Diverse Training: Trained on 20 medical tasks, resulting in a robust and versatile healthcare model.
- High Performance: Evaluations show Aloe models to be among the best in their class. When combined with a RAG system, the 7B and 8B versions get close to the performance of closed models, and the 70B and 72B versions outperform them.
đĻ Installation
No specific installation steps are provided in the original README. If you want to use the Aloe model, you can follow the general steps of using the transformers
library:
pip install transformers torch
đ Documentation
Model Details
Model Description
Property |
Details |
Developed by |
HPAI |
Model Type |
Causal decoder - only transformer language model |
Language(s) (NLP) |
English (capable but not formally evaluated on other languages) |
License |
This model is based on Qwen2.5-7B which is released with Apache 2.0 license. All modifications are available with a CC BY 4.0 license, making the Aloe Beta models compatible with commercial use. |
Base model |
Qwen2.5-7B |
Paper |
(more coming soon) |
RAG Repository |
https://github.com/HPAI-BSC/prompt_engine |
Model Performance
Aloe Beta has been tested on the most popular healthcare QA datasets, with and without Medprompt inference technique. Results show competitive performance, achieving SOTA within models of the same size. It has been evaluated in many different medical tasks and also compared in the general domain using the OpenLLM Leaderboard benchmark, showing good results.
Uses
Direct Use
We encourage the use of Aloe for research purposes, as a stepping stone to build better foundational models for healthcare. In production, Aloe should always be used under the supervision of a human expert.
Out - of - Scope Use
These models are not to be used for clinical practice, medical diagnosis, or any other form of direct or indirect healthcare advice. Models are prone to error and can produce toxic content. The use of Aloe models for activities harmful to individuals, such as spam, fraud, or impersonation, is strictly prohibited. Minors should not be left alone to interact with Aloe without supervision.
Bias, Risks, and Limitations
Aloe can produce toxic content under the appropriate prompts and includes multiple undesirable biases. Although efforts have been made to mitigate this, model safety cannot be fully guaranteed. We avoid using all personal data in training.
We identify at least three risk cases specific to healthcare LLMs:
- Healthcare professional impersonation: Aloe could be used to increase the efficacy of such deceiving activities. Preventive actions include public literacy and legislation.
- Medical decision - making without professional supervision: Aloe can facilitate self - delusion. Public literacy on self - diagnosis dangers and disclaimers are important defenses.
- Access to information on dangerous substances or procedures: LLMs can centralize access to sensitive information. Model alignment helps but is insufficient due to jailbreaking methods.
Training Details
Supervised fine - tuning
SFT on top of Qwen2.5 - 7B using axolotl (https://github.com/axolotl - ai - cloud/axolotl). Hardware used for different model sizes:
- 7B: 32x NVIDIA Hopper H100 64GB of the Marenostrum 5.
- 8B: 32x NVIDIA Hopper H100 64GB of the Marenostrum 5.
- 70B: 64x NVIDIA Hopper H100 64GB of the Marenostrum 5.
- 72B: 92x NVIDIA Hopper H100 64GB of the Marenostrum 5.
Training Data
The training set consists of around 1.8B tokens, having 3 different types of data:
- Medical domain datasets: Includes data from 20 different medical tasks, such as [HPAI - BSC/Aloe - Beta - General - Collection](https://huggingface.co/datasets/HPAI - BSC/Aloe - Beta - General - Collection), [HPAI - BSC/chain - of - diagnosis](https://huggingface.co/datasets/HPAI - BSC/chain - of - diagnosis), etc.
- Synthetic data: Generated high - quality answers using Llama3.1 - 70B, including [HPAI - BSC/pubmedqa - cot - llama31](https://huggingface.co/datasets/HPAI - BSC/pubmedqa - cot - llama31), etc.
- General data: It includes maths, STEM, code, function calling, and instructions with a very long context, like [HPAI - BSC/Aloe - Beta - General - Collection](https://huggingface.co/datasets/HPAI - BSC/Aloe - Beta - General - Collection).
Training parameters
- Epochs: 3
- Sequence length: 16384
- Optimizer: adamw_torch
- Learning rate: 1e - 5
- Learning rate scheduler: cosine
- Warmup steps: 100
- Weight decay: 0
- Gradient checkpointing
- Zero 3
- Total batch size: 128
- Batch size per device: 1
- Gradient accumulation steps: 4
Model Merging
The model trained was merged with the Qwen2.5 - 7B - Instruct model using the DARE_TIES technique. [Mergekit](https://github.com/arcee - ai/mergekit) was used to conduct the merging.
Model Alignment
The model is aligned using the Direct Preference Optimization (DPO) technique through a two - step process:
- General DPO Alignment: Uses a dataset combining medical, general preference, and safety data. We used [HPAI - BSC/Aloe - Beta - DPO](https://huggingface.co/datasets/HPAI - BSC/Aloe - Beta - DPO). Trained iteratively for one epoch on each chunk with a learning rate of 2e - 7.
- Red - Teaming Alignment: Further fine - tunes the model to resist attacks. Dataset will be shared soon. Learning rate is set to 1e - 7.
We used OpenRLHF library and aligned the model using 16x NVIDA HOOPER H100 64GB of the Marenostrum 5. Common hyperparameters:
- Sequence length: 4096
- Optimizer: Fused adam
- Total batch size 128
- Batch size per device: 1
- Gradient accumulation steps: 8
- Beta: 0.1
Evaluation
Testing Data, Factors & Metrics
Testing Data
- [ACI - BENCH](https://github.com/wyim/aci - bench)
- [MTS - Dialog](https://github.com/abachaa/MTS - Dialog)
- MedText
- [Medical Text classification](https://www.kaggle.com/datasets/chaitanyakck/medical - text/data)
- [OLAPH](https://github.com/dmis - lab/OLAPH)
- CareQA Open
- MedDialog
- MEDIQA QA
- Meddialog Qsumm
- Biored
- [MIMIC - III](https://huggingface.co/datasets/dmacres/mimiciii - hospitalcourse - meta)
- [Medical Prescription](https://huggingface.co/datasets/devlocalhost/prescription - full)
- MedQA (USMLE)
- MedMCQA
- PubMedQA
- MMLU - Medical
- [MedQA - 4 - Option](https://huggingface.co/datasets/GBaker/MedQA - USMLE - 4 - options)
- [CareQA](https://huggingface.co/datasets/HPAI - BSC/CareQA)
- [Open LLM Leaderboard 2](https://huggingface.co/spaces/open - llm - leaderboard/open_llm_leaderboard)
Metrics
- Accuracy: suite the evaluation of multiple - choice question - answering tasks.
- Rouge1: refers to the overlap of unigrams between the system and the gold standard.
Summary
Benchmark results indicate that the training of Aloe has boosted its performance above all other open models within the same model size. With the help of prompting techniques, the performance of Qwen2.5 - Aloe - Beta - 7B is significantly improved.
đ License
This model is based on Qwen2.5-7B which is released with Apache 2.0 license. All modifications are available with a CC BY 4.0 license, making the Aloe Beta models compatible with commercial use.