đ TxGemma Model Card
TxGemma is a collection of lightweight, state-of-the-art open language models fine - tuned for therapeutic development. It offers high versatility, data efficiency, and conversational capabilities, serving as a powerful tool for drug discovery and related research.
đ Quick Start
To quickly start using the TxGemma model, you can refer to the following code snippets to run the model locally on GPU. If you plan to run inference on a large number of inputs, it is recommended to create a production version using Model Garden.
đģ Usage Examples
Basic Usage
import json
from huggingface_hub import hf_hub_download
tdc_prompts_filepath = hf_hub_download(
repo_id="google/txgemma-27b-chat",
filename="tdc_prompts.json",
)
with open(tdc_prompts_filepath, "r") as f:
tdc_prompts_json = json.load(f)
task_name = "BBB_Martins"
input_type = "{Drug SMILES}"
drug_smiles = "CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21"
TDC_PROMPT = tdc_prompts_json[task_name].replace(input_type, drug_smiles)
print(TDC_PROMPT)
Advanced Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-chat")
model = AutoModelForCausalLM.from_pretrained(
"google/txgemma-27b-chat",
device_map="auto",
)
prompt = TDC_PROMPT
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
⨠Features
- Versatility: Demonstrates strong performance across a wide range of therapeutic tasks, outperforming or matching best - in - class performance on many benchmarks.
- Data Efficiency: Shows competitive performance with limited data compared to larger models, and offers improvements over its predecessors.
- Conversational Capability (TxGemma - Chat): Includes conversational variants that can engage in natural language dialogue and explain the reasoning behind their predictions.
- Foundation for Fine - tuning: Can be used as a pre - trained foundation for specialized use cases.
đĻ Installation
The code snippets provided assume you have installed the necessary libraries. You can install them using the following commands:
pip install accelerate transformers
đ Documentation
Model Information
TxGemma is a collection of lightweight, state - of - the - art, open language models built upon Gemma 2, fine - tuned for therapeutic development. It comes in 3 sizes: 2B, 9B, and 27B.
Potential Applications:
- Accelerated Drug Discovery: Streamline the therapeutic development process by predicting properties of therapeutics and targets for various tasks, such as target identification, drug - target interaction prediction, and clinical trial approval prediction.
How to Use
- Formatting prompts for therapeutic tasks: Refer to the code example above for formatting prompts according to the TDC structure.
- Running the model on predictive tasks: You can use the
AutoTokenizer
and AutoModelForCausalLM
classes from the transformers
library, or the pipeline
API to run the model.
- Applying the chat template for conversational use: Use the tokenizer's built - in chat template to format prompts for conversational use.
Examples
Model architecture overview
- Base Model: Gemma 2 (2B, 9B, and 27B parameter versions).
- Fine - tuning Data: Therapeutics Data Commons.
- Training Approach: Instruction fine - tuning using a mixture of therapeutic data (TxT) and, for conversational variants, general instruction - tuning data.
Technical Specifications
Performance & Validation
TxGemma's performance has been validated on a comprehensive benchmark of 66 therapeutic tasks derived from TDC.
Key performance metrics
- Aggregated Improvement: Improves over the original Tx - LLM paper on 45 out of 66 therapeutic tasks.
- Best - in - Class Performance: Surpasses or matches best - in - class performance on 50 out of 66 tasks, exceeding specialist models on 26 tasks.
Inputs and outputs
- Input: Text. For best performance, text prompts should be formatted according to the TDC structure.
- Output: Text.
Dataset details
- Training dataset: Therapeutics Data Commons and General Instruction - Tuning Data (for TxGemma - Chat).
- Evaluation dataset: Therapeutics Data Commons, using the same 66 tasks for evaluation.
Implementation information
Training was done using JAX, which allows for faster and more efficient training of large models on the latest generation of hardware, including TPUs.
Use and limitations
- Intended use: Research and development of therapeutics.
- Benefits: Strong performance, data efficiency, fine - tuning foundation, and integration into agentic workflows.
- Limitations: Trained on public data from TDC, task - specific validation is required, and downstream applications need to be validated with appropriate data.
đ§ Technical Details
TxGemma is based on the Gemma 2 family of lightweight, state - of - the - art open LLMs. It utilizes a decoder - only transformer architecture. The fine - tuning data comes from the Therapeutics Data Commons, which covers diverse therapeutic modalities and targets. The training approach involves instruction fine - tuning using a mixture of therapeutic data and, for conversational variants, general instruction - tuning data.
đ License
The use of TxGemma is governed by the Health AI Developer Foundations terms of use.
Citation
@article{wang2025txgemma,
title={TxGemma: Efficient and Agentic LLMs for Therapeutics},
author={Wang, Eric and Schmidgall, Samuel and Jaeger, Paul F. and Zhang, Fan and Pilgrim, Rory and Matias, Yossi and Barral, Joelle and Fleet, David and Azizi, Shekoofeh},
year={2025},
}
Find the paper here.