TxGemma-27b-chat开源语言模型 - 免费支持治疗开发，多种规模可选

首页

Txgemma 27b Chat

由 google 开发

TxGemma是基于Gemma 2构建的轻量级开源语言模型，专为治疗开发而微调，提供2B、9B和27B三种规模。

大型语言模型

Transformers

英语开源协议:其他 #药物属性预测 #治疗开发对话 #多模态治疗理解

下载量 1,221

发布时间 : 3/21/2025

模型简介

TxGemma是一组专为治疗开发优化的语言模型，能够处理和理解与各种治疗方式和目标相关的信息，包括小分子、蛋白质、核酸、疾病和细胞系。

模型特点

多功能性

在广泛的治疗任务中表现出色，在大量基准测试中优于或匹配最佳性能。

数据效率

即使数据有限，也能表现出竞争力，相比前代模型有所改进。

对话能力

包含对话变体（TxGemma-Chat），可进行自然语言对话并解释预测背后的逻辑。

微调基础

可作为预训练基础，用于专门用途的进一步微调。

模型能力

治疗属性预测

药物-目标相互作用预测

临床试验批准预测

自然语言对话

预测逻辑解释

使用案例

药物发现

血脑屏障穿透预测

根据药物SMILES字符串预测其是否能穿透血脑屏障。

在BBB_Martins任务上表现优异

目标识别

帮助研究人员识别潜在的治疗目标。

治疗开发

药物属性预测

预测各种治疗药物和目标的属性。

🚀 TxGemma模型卡片

TxGemma是基于Gemma 2构建的轻量级、最先进的开放语言模型集合，针对治疗开发进行了微调。它能够处理和理解与各种治疗方式和靶点相关的信息，可用于药物发现等多个领域。

🚀 快速开始

模型文档

TxGemma

资源链接

Google Cloud Model Garden上的模型：TxGemma
Hugging Face上的模型：TxGemma
GitHub仓库（包含支持代码、Colab笔记本、讨论和问题）：TxGemma
快速入门笔记本：notebooks/quick_start
支持信息：请参阅联系我们

使用条款

Health AI Developer Foundations使用条款

作者

Google

✨ 主要特性

描述

TxGemma是基于Gemma 2构建的轻量级、最先进的开放语言模型集合，针对治疗开发进行了微调。它有2B、9B和27B三种规模。

TxGemma模型旨在处理和理解与各种治疗方式和靶点相关的信息，包括小分子、蛋白质、核酸、疾病和细胞系等。TxGemma在属性预测等任务中表现出色，可作为进一步微调的基础，或作为药物发现的交互式对话代理。该模型使用从Therapeutics Data Commons (TDC)精选的多样化指令调优数据集，从Gemma 2进行微调。

TxGemma既提供期望特定形式提示的预测模型，对于9B和27B版本，还提供更灵活的对话模型，可用于多轮交互，包括解释预测背后的原理。这种对话模型在一定程度上牺牲了原始预测性能。更多信息请参阅我们的论文。

关键特性

多功能性：在广泛的治疗任务中表现出色，在大量基准测试中超越或达到同类最佳性能。
数据效率：与更大的模型相比，即使在数据有限的情况下也能展现出有竞争力的性能，相较于其前身有所改进。
对话能力（TxGemma-Chat）：包含能够进行自然语言对话并解释预测推理的对话变体。
微调基础：可作为预训练基础用于特定用例。

潜在应用

TxGemma可以成为以下领域研究人员的宝贵工具：

加速药物发现：通过预测治疗方法和靶点的属性，简化治疗开发过程，适用于各种任务，包括靶点识别、药物 - 靶点相互作用预测和临床试验批准预测。

💻 使用示例

基础用法

治疗任务提示格式化

import json
from huggingface_hub import hf_hub_download

# Load prompt template for tasks from TDC
tdc_prompts_filepath = hf_hub_download(
    repo_id="google/txgemma-27b-chat",
    filename="tdc_prompts.json",
)
with open(tdc_prompts_filepath, "r") as f:
    tdc_prompts_json = json.load(f)

# Set example TDC task and input
task_name = "BBB_Martins"
input_type = "{Drug SMILES}"
drug_smiles = "CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21"

# Construct prompt using template and input drug SMILES string
TDC_PROMPT = tdc_prompts_json[task_name].replace(input_type, drug_smiles)
print(TDC_PROMPT)

生成的提示符合模型预期的格式：

Instructions: Answer the following question about drug properties.
Context: As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier (BBB) is the protection layer that blocks most foreign drugs. Thus the ability of a drug to penetrate the barrier to deliver to the site of action forms a crucial challenge in development of drugs for central nervous system.
Question: Given a drug SMILES string, predict whether it
(A) does not cross the BBB (B) crosses the BBB
Drug SMILES: CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21
Answer:

在预测任务上运行模型

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-27b-chat",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(**input_ids, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

或者，您可以使用pipeline API，它提供了一种简单的方式来运行推理，同时抽象掉加载和使用模型及分词器的复杂细节：

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-27b-chat",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Generate response
outputs = pipe(prompt, max_new_tokens=8)
response = outputs[0]["generated_text"]
print(response)

应用聊天模板进行对话使用

TxGemma-Chat模型使用聊天模板，在进行对话使用时必须遵循该模板。最简单的应用方式是使用分词器的内置聊天模板，如下所示：

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-27b-chat",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Format prompt in the conversational format
messages = [
    { "role": "user", "content": prompt}
]

# Apply the tokenizer's built-in chat template
chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

此时，提示包含以下文本：

<bos><start_of_turn>user
Instructions: Answer the following question about drug properties.
Context: As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier (BBB) is the protection layer that blocks most foreign drugs. Thus the ability of a drug to penetrate the barrier to deliver to the site of action forms a crucial challenge in development of drugs for central nervous system.
Question: Given a drug SMILES string, predict whether it
(A) does not cross the BBB (B) crosses the BBB
Drug SMILES: CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21
Answer:<end_of_turn>
<start_of_turn>model

如您所见，每个回合都以<start_of_turn>分隔符开头，然后是实体的角色（用户提供的内容为user，大语言模型响应为model）。回合以<end_of_turn>标记结束。

如果需要在不使用分词器的聊天模板的情况下手动构建提示，您可以遵循此格式。

提示准备好后，可以按以下方式进行生成：

inputs = tokenizer.encode(chat_prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_new_tokens=8)
response = tokenizer.decode(outputs[0, len(inputs[0]):], skip_special_tokens=True)
print(response)

对于多次交互，将模型响应和用户提示添加到聊天消息历史记录中，然后以相同方式进行生成：

messages.extend([
    { "role": "assistant", "content": response },
    { "role": "user", "content": "Explain your reasoning based on the molecule structure." },
])

chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(chat_prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_new_tokens=512)
response = tokenizer.decode(outputs[0, len(inputs[0]):], skip_special_tokens=True)
print(response)

或者，您可以使用pipeline API，它抽象掉了使用分词器应用聊天模板等细节：

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-27b-chat",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Format prompt in the conversational format for initial turn
messages = [
    { "role": "user", "content": prompt}
]

# Generate response for initial turn
outputs = pipe(messages, max_new_tokens=8)
print(outputs[0]["generated_text"][-1]["content"].strip())

# Append user prompt for an additional turn
messages = outputs[0]["generated_text"]
messages.append(
    { "role": "user", "content": "Explain your reasoning based on the molecule structure." }
)

# Generate response for additional turn
outputs = pipe(messages, max_new_tokens=512)
print(outputs[0]["generated_text"][-1]["content"].strip())

示例

有关如何使用TxGemma的示例，请参阅以下Colab笔记本：

若要快速尝试该模型，在本地使用Hugging Face的权重运行，请参阅Colab中的快速入门笔记本，其中包含一些来自TDC的示例评估任务。
若要了解如何在Hugging Face中微调TxGemma，请参阅我们的Colab中的微调笔记本。
若要了解如何将TxGemma作为由Gemini 2驱动的更大代理工作流的一部分使用，请参阅Colab中的代理工作流笔记本。

📚 详细文档

模型架构概述

TxGemma基于Gemma 2系列轻量级、最先进的开放大语言模型。它采用仅解码器的Transformer架构。
基础模型：Gemma 2（2B、9B和27B参数版本）。
微调数据：Therapeutics Data Commons，这是一个包含多样化治疗方式和靶点的指令调优数据集集合。
训练方法：使用治疗数据（TxT）的混合进行指令微调，对于对话变体，还使用通用指令调优数据。
对话变体：TxGemma-Chat模型（9B和27B）使用治疗和通用指令调优数据的混合进行训练，以保持对话能力。

技术规格

属性	详情
模型类型	仅解码器的Transformer（基于Gemma 2）
关键出版物	TxGemma: Efficient and Agentic LLMs for Therapeutics
模型创建时间	2025-03-18（来自TxGemma变体提案）
模型版本	1.0.0

性能与验证

TxGemma的性能已在从TDC派生的66个治疗任务的综合基准测试中得到验证。

关键性能指标

综合改进：在66个治疗任务中的45个任务上，比原始的Tx-LLM论文有所改进。
同类最佳性能：在66个任务中的50个任务上超越或达到同类最佳性能，在26个任务上超过专业模型。完整明细请参阅TxGemma论文的表A.11。

输入和输出

输入：文本。为获得最佳性能，文本提示应根据TDC结构进行格式化，包括指令、上下文、问题，以及可选的少样本示例。输入可以包括SMILES字符串、氨基酸序列、核苷酸序列和自然语言文本。
输出：文本。

🔧 技术细节

数据集详情

训练数据集

Therapeutics Data Commons：一个精心策划的指令调优数据集集合，涵盖66个任务，涉及安全有效药物的发现和开发。这包括跨不同生物医学实体的超过1500万个数据点。已发布的TxGemma模型仅在具有商业许可证的数据集上进行训练，而我们论文中的模型还在具有非商业许可证的数据集上进行训练。
通用指令调优数据：与TDC结合用于TxGemma-Chat。

评估数据集

Therapeutics Data Commons：使用与训练相同的66个任务进行评估，遵循TDC推荐的数据分割方法（随机、支架、冷启动、组合和时间分割）。

实现信息

软件

训练使用JAX进行。JAX允许研究人员利用最新一代的硬件，包括TPU，以更快、更高效地训练大型模型。

使用和限制

预期用途

治疗方法的研究和开发。

优点

TxGemma为加速治疗开发提供了一个多功能且强大的工具。它具有以下优点：

在广泛的任务中表现出色。
与更大的模型相比，数据效率更高。
可作为从私有数据进行进一步微调的基础。
可集成到代理工作流中。

限制

在来自TDC的公共数据上进行训练。
特定任务的验证仍然是最终用户进行下游模型开发的重要方面。
与任何研究一样，开发人员应确保任何下游应用都经过验证，以了解使用与特定应用预期使用场景（如年龄、性别、状况、扫描仪等）适当代表的数据的性能。

📄 许可证

TxGemma的使用受Health AI Developer Foundations使用条款的约束。

引用

@article{wang2025txgemma,
    title={TxGemma: Efficient and Agentic LLMs for Therapeutics},
    author={Wang, Eric and Schmidgall, Samuel and Jaeger, Paul F. and Zhang, Fan and Pilgrim, Rory and Matias, Yossi and Barral, Joelle and Fleet, David and Azizi, Shekoofeh},
    year={2025},
}

请在此处查找该论文。