TxGemma-27b-predict Open Source Language Model - Free Support for Treatment Development, Precise Handling of Treatment and Target Information

Txgemma 27b Predict

Developed by google

TxGemma is a series of lightweight, advanced open language models based on Gemma 2, specifically fine-tuned for therapeutic development. Available in 2B, 9B, and 27B sizes, it excels in processing information related to therapeutic modalities and targets.

Large Language Model

Transformers

EnglishOpen Source License:Other #Drug Property Prediction #Therapeutic Development Dialogue #Multimodal Therapy Understanding

Downloads 1,255

Release Time : 3/21/2025

Model Overview

TxGemma is a language model series optimized for therapeutic development, excelling in tasks such as property prediction, serving as a foundational model for drug discovery or an interactive dialogue agent. Supports processing various therapy-related data including small molecules, proteins, and nucleic acids.

Model Features

Versatility

Excels in a wide range of therapeutic tasks, surpassing or matching top performance in numerous benchmarks.

Data Efficiency

Competitive even with limited data compared to larger models.

Dialogue Capability

The TxGemma-Chat variant can engage in natural language dialogues and explain prediction logic.

Fine-Tuning Foundation

Can serve as a pre-trained foundation for specialized therapeutic development use cases.

Model Capabilities

Therapeutic Property Prediction

Drug-Target Interaction Analysis

Natural Language Dialogue Explanation

Multi-turn Interactive Reasoning

Therapeutic Development Decision Support

Use Cases

Drug Discovery

Blood-Brain Barrier Penetration Prediction

Predicts a drug's ability to penetrate the blood-brain barrier based on its SMILES string.

Outperforms in the BBB_Martins task.

Target Identification

Analyzes potential drug target interaction possibilities.

Clinical Research

Clinical Trial Approval Prediction

Predicts the likelihood of a drug receiving clinical trial approval.

🚀 TxGemma model card

TxGemma is a collection of lightweight, state-of-the-art, open language models fine - tuned for therapeutic development. It can process and understand therapeutic information, and is useful for various drug discovery tasks.

🚀 Quick Start

Model documentation: TxGemma

Resources:

Model on Google Cloud Model Garden: TxGemma
Model on Hugging Face: TxGemma
GitHub repository (supporting code, Colab notebooks, discussions, and issues): TxGemma
Quick start notebook: notebooks/quick_start
Support: See Contact.

Author: Google

✨ Features

Description

TxGemma is built upon Gemma 2 and comes in 3 sizes (2B, 9B, and 27B). It's designed to handle information related to various therapeutic modalities and targets. It can perform tasks like property prediction and can be used as a foundation for further fine - tuning or as a conversational agent for drug discovery.

Key Features:

Versatility: Performs well across a wide range of therapeutic tasks, outperforming or matching best - in - class on many benchmarks.
Data Efficiency: Shows competitive performance with limited data compared to larger models.
Conversational Capability (TxGemma - Chat): Can engage in natural language dialogue and explain prediction reasoning.
Foundation for Fine - tuning: Can be used as a pre - trained base for specialized use cases.

Potential Applications:

Accelerated Drug Discovery: Streamline the therapeutic development process by predicting properties for various tasks such as target identification, drug - target interaction prediction, and clinical trial approval prediction.

📦 Installation

To access TxGemma on Hugging Face, you're required to review and agree to Health AI Developer Foundation's terms of use. To do this, please ensure you're logged in to Hugging Face and click below. Requests are processed immediately.

⚠️ Important Note

You need to acknowledge the license to access TxGemma on Hugging Face.

💡 Usage Tip

If you want to use the model to run inference on a large number of inputs, create a production version using Model Garden.

💻 Usage Examples

Basic Usage

Below are some example code snippets to help you quickly get started running the model locally on GPU.

Formatting prompts for therapeutic tasks

import json
from huggingface_hub import hf_hub_download

# Load prompt template for tasks from TDC
tdc_prompts_filepath = hf_hub_download(
    repo_id="google/txgemma-27b-predict",
    filename="tdc_prompts.json",
)
with open(tdc_prompts_filepath, "r") as f:
    tdc_prompts_json = json.load(f)

# Set example TDC task and input
task_name = "BBB_Martins"
input_type = "{Drug SMILES}"
drug_smiles = "CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21"

# Construct prompt using template and input drug SMILES string
TDC_PROMPT = tdc_prompts_json[task_name].replace(input_type, drug_smiles)
print(TDC_PROMPT)

The resulting prompt is in the format expected by the model:

Instructions: Answer the following question about drug properties.
Context: As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier (BBB) is the protection layer that blocks most foreign drugs. Thus the ability of a drug to penetrate the barrier to deliver to the site of action forms a crucial challenge in development of drugs for central nervous system.
Question: Given a drug SMILES string, predict whether it
(A) does not cross the BBB (B) crosses the BBB
Drug SMILES: CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21
Answer:

Running the model on predictive tasks

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-predict")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-27b-predict",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(**input_ids, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

You can use the pipeline API, which provides a simple way to run inference while abstracting away complex details of loading and using the model and tokenizer:

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-27b-predict",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Generate response
outputs = pipe(prompt, max_new_tokens=8)
response = outputs[0]["generated_text"]
print(response)

Examples

To give the model a quick try, running it locally with weights from Hugging Face, see Quick start notebook in Colab, which includes some example eval tasks from TDC.
For a demo of how to fine - tune TxGemma in Hugging Face, see our Fine - tuning notebook in Colab.
For a demo of how TxGemma can be used as a tool as part of a larger agentic workflow powered by Gemini 2 see the Agentic workflow notebook in Colab.

📚 Documentation

Model architecture overview

TxGemma is based on the Gemma 2 family of lightweight, state - of - the - art open LLMs. It uses a decoder - only transformer architecture.
Base Model: Gemma 2 (2B, 9B, and 27B parameter versions).
Fine - tuning Data: Therapeutics Data Commons, a collection of instruction - tuning datasets covering diverse therapeutic modalities and targets.
Training Approach: Instruction fine - tuning using a mixture of therapeutic data (TxT) and, for conversational variants, general instruction - tuning data.
Conversational Variants: TxGemma - Chat models (9B and 27B) are trained with a mixture of therapeutic and general instruction - tuning data to maintain conversational abilities.

Technical Specifications

Property	Details
Model Type	Decoder - only Transformer (based on Gemma 2)
Key Publication	TxGemma: Efficient and Agentic LLMs for Therapeutics
Model Created	2025 - 03 - 18 (From the TxGemma Variant Proposal)
Model Version	1.0.0

Performance & Validation

TxGemma's performance has been validated on a comprehensive benchmark of 66 therapeutic tasks derived from TDC.

Key performance metrics

Aggregated Improvement: Improves over the original Tx - LLM paper on 45 out of 66 therapeutic tasks.
Best - in - Class Performance: Surpasses or matches best - in - class performance on 50 out of 66 tasks, exceeding specialist models on 26 tasks. See Table A.11 of the TxGemma paper for the full breakdown.

Inputs and outputs

Input: Text. For best performance, text prompts should be formatted according to the TDC structure, including instructions, context, question, and, optionally, few - shot examples. Inputs can include SMILES strings, amino acid sequences, nucleotide sequences, and natural language text.
Output: Text.

Dataset details

Training dataset

Therapeutics Data Commons: A curated collection of instruction - tuning datasets covering 66 tasks spanning the discovery and development of safe and effective medicine. This includes over 15 million data points across different biomedical entities. Released TxGemma models are only trained on datasets with commercial licenses, whereas models in our publication are also trained on datasets with non - commercial licenses.
General Instruction - Tuning Data: Used for TxGemma - Chat in combination with TDC.

Evaluation dataset

Therapeutics Data Commons: The same 66 tasks used for training are used for evaluation, following TDC's recommended methodologies for data splits (random, scaffold, cold - start, combination, and temporal).

Implementation information

Software

Training was done using JAX. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.

Use and limitations

Intended use

Research and development of therapeutics.

Benefits

TxGemma provides a versatile and powerful tool for accelerating therapeutic development. It offers:

Strong performance across a wide range of tasks.
Data efficiency compared to larger models.
A foundation for further fine - tuning from private data.
Integration into agentic workflows.

Limitations

Trained on public data from TDC.
Task - specific validation remains an important aspect of downstream model development by the end user.
As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, scanner, etc.).

🔧 Technical Details

TxGemma's performance has been validated on a comprehensive benchmark of 66 therapeutic tasks derived from TDC. Key performance metrics show significant improvements over previous models on many tasks. The model is based on a decoder - only transformer architecture from the Gemma 2 family and is fine - tuned using Therapeutics Data Commons and general instruction - tuning data.

📄 License

The use of TxGemma is governed by the Health AI Developer Foundations terms of use.

Citation

@article{wang2025txgemma,
    title={TxGemma: Efficient and Agentic LLMs for Therapeutics},
    author={Wang, Eric and Schmidgall, Samuel and Jaeger, Paul F. and Zhang, Fan and Pilgrim, Rory and Matias, Yossi and Barral, Joelle and Fleet, David and Azizi, Shekoofeh},
    year={2025},
}

Find the paper here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご