TxGemma-2b-predict Open-Source Language Model - Fine-Tuned for Therapeutic R&D, Free to Use!

Txgemma 2b Predict

Developed by google

TxGemma is a lightweight open-source language model series based on Gemma 2, specifically fine-tuned for therapeutic development, available in 2B, 9B, and 27B sizes.

Large Language Model

Transformers

EnglishOpen Source License:Other #Dedicated to Drug Development #Therapeutic Data Fine-tuning #Multi-task Prediction

Downloads 6,980

Release Time : 3/21/2025

Model Overview

TxGemma excels in understanding and predicting therapeutic-related information, serving as a fine-tuning foundation or an interactive dialogue agent for drug discovery.

Model Features

Versatility

Outperforms in 66 therapeutic tasks

Data Efficiency

Remains competitive in few-shot scenarios

Dialogue Capability

Supports natural language interaction and reasoning explanations (only 9B/27B versions)

Fine-tuning Foundation

Allows secondary development for specific scenarios

Model Capabilities

Therapeutic-related text understanding

Drug property prediction

Multi-turn dialogue interaction

Reasoning explanation generation

Use Cases

Drug Development

Target Identification

Identify potential drug targets

Drug-Target Interaction Prediction

Predict interactions between drugs and targets

Clinical Trial Approval Prediction

Predict the likelihood of clinical trial approval

🚀 TxGemma model card

TxGemma is a collection of lightweight, state - of - the - art language models fine - tuned for therapeutic development. It offers strong performance across various therapeutic tasks and can be a valuable tool in drug discovery and development.

🚀 Quick Start

To access TxGemma on Hugging Face, you're required to review and agree to Health AI Developer Foundation's terms of use. To do this, please ensure you're logged in to Hugging Face and click below. Requests are processed immediately.

Model documentation: TxGemma
Resources:
- Model on Google Cloud Model Garden: TxGemma
- Model on Hugging Face: TxGemma
- GitHub repository (supporting code, Colab notebooks, discussions, and issues): TxGemma
- Quick start notebook: notebooks/quick_start
- Support: See Contact.
Terms of use: Health AI Developer Foundations terms of use
Author: Google

✨ Features

Description

TxGemma is built upon Gemma 2 and comes in 3 sizes (2B, 9B, and 27B). It can process and understand information related to various therapeutic modalities and targets. It can be used for property prediction and as a foundation for further fine - tuning or an interactive, conversational agent for drug discovery.

Key Features:

Versatility: Performs well across a wide range of therapeutic tasks, outperforming or matching best - in - class on many benchmarks.
Data Efficiency: Shows competitive performance with limited data compared to larger models.
Conversational Capability (TxGemma - Chat): Can engage in natural language dialogue and explain prediction reasoning.
Foundation for Fine - tuning: Can be used as a pre - trained foundation for specialized use cases.

Potential Applications:

Accelerated Drug Discovery: Streamlines the therapeutic development process by predicting properties of therapeutics and targets for multiple tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Formatting prompts for therapeutic tasks

import json
from huggingface_hub import hf_hub_download

# Load prompt template for tasks from TDC
tdc_prompts_filepath = hf_hub_download(
    repo_id="google/txgemma-2b-predict",
    filename="tdc_prompts.json",
)
with open(tdc_prompts_filepath, "r") as f:
    tdc_prompts_json = json.load(f)

# Set example TDC task and input
task_name = "BBB_Martins"
input_type = "{Drug SMILES}"
drug_smiles = "CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21"

# Construct prompt using template and input drug SMILES string
TDC_PROMPT = tdc_prompts_json[task_name].replace(input_type, drug_smiles)
print(TDC_PROMPT)

Running the model on predictive tasks

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-2b-predict")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-2b-predict",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(**input_ids, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-2b-predict",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Generate response
outputs = pipe(prompt, max_new_tokens=8)
response = outputs[0]["generated_text"]
print(response)

Examples

To quickly try the model locally with weights from Hugging Face, see Quick start notebook in Colab.
For a demo of fine - tuning TxGemma in Hugging Face, see Fine - tuning notebook in Colab.
For a demo of using TxGemma as part of a larger agentic workflow powered by Gemini 2, see Agentic workflow notebook in Colab.

📚 Documentation

Model architecture overview

Based on the Gemma 2 family of lightweight LLMs, using a decoder - only transformer architecture.
Base Model: Gemma 2 (2B, 9B, and 27B parameter versions).
Fine - tuning Data: Therapeutics Data Commons.
Training Approach: Instruction fine - tuning using therapeutic data and, for conversational variants, general instruction - tuning data.
Conversational Variants: TxGemma - Chat models (9B and 27B) are trained with a mixture of data to maintain conversational abilities.

Technical Specifications

Property	Details
Model Type	Decoder - only Transformer (based on Gemma 2)
Key publication	TxGemma: Efficient and Agentic LLMs for Therapeutics
Model created	2025 - 03 - 18 (From the TxGemma Variant Proposal)
Model Version	1.0.0

Performance & Validation

TxGemma's performance has been validated on a benchmark of 66 therapeutic tasks from TDC.

Key performance metrics

Aggregated Improvement: Improves over the original Tx - LLM paper on 45 out of 66 therapeutic tasks.
Best - in - Class Performance: Surpasses or matches best - in - class on 50 out of 66 tasks, exceeding specialist models on 26 tasks. See Table A.11 of the TxGemma paper for details.

Inputs and outputs

Input: Text. For best results, format text prompts according to the TDC structure. Inputs can include SMILES strings, amino acid sequences, nucleotide sequences, and natural language text.
Output: Text.

Dataset details

Training dataset

Therapeutics Data Commons: A curated collection of instruction - tuning datasets for 66 tasks in medicine discovery and development, with over 15 million data points. Released models are trained on commercially - licensed datasets, while models in the publication are also trained on non - commercially - licensed datasets.
General Instruction - Tuning Data: Used for TxGemma - Chat in combination with TDC.

Evaluation dataset

Therapeutics Data Commons: The same 66 tasks used for training are used for evaluation, following TDC's recommended data split methodologies.

Implementation information

Software

Training was done using JAX. JAX allows for faster and more efficient training of large models on the latest hardware, including TPUs.

Use and limitations

Intended use

Research and development of therapeutics.

Benefits

Strong performance across tasks.
Data efficiency compared to larger models.
A foundation for further fine - tuning from private data.
Integration into agentic workflows.

Limitations

Trained on public data from TDC.
Task - specific validation is important for downstream model development.
Developers should validate downstream applications using representative data.

🔧 Technical Details

TxGemma's performance has been thoroughly evaluated on a comprehensive benchmark of 66 therapeutic tasks derived from the Therapeutics Data Commons (TDC). It shows significant improvements over previous models in many tasks and matches or surpasses best - in - class performance on a large number of them.

📄 License

The use of TxGemma is governed by the Health AI Developer Foundations terms of use.

📖 Citation

@article{wang2025txgemma,
    title={TxGemma: Efficient and Agentic LLMs for Therapeutics},
    author={Wang, Eric and Schmidgall, Samuel and Jaeger, Paul F. and Zhang, Fan and Pilgrim, Rory and Matias, Yossi and Barral, Joelle and Fleet, David and Azizi, Shekoofeh},
    year={2025},
}

Find the paper here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご