TxGemma-27b-chat開源語言模型 - 免費支持治療開發，多種規模可選

首頁

Txgemma 27b Chat

由google開發

TxGemma是基於Gemma 2構建的輕量級開源語言模型，專為治療開發而微調，提供2B、9B和27B三種規模。

大型語言模型

Transformers

英語開源協議:其他 #藥物屬性預測 #治療開發對話 #多模態治療理解

下載量 1,221

發布時間 : 3/21/2025

模型概述

TxGemma是一組專為治療開發優化的語言模型，能夠處理和理解與各種治療方式和目標相關的信息，包括小分子、蛋白質、核酸、疾病和細胞系。

模型特點

多功能性

在廣泛的治療任務中表現出色，在大量基準測試中優於或匹配最佳性能。

數據效率

即使數據有限，也能表現出競爭力，相比前代模型有所改進。

對話能力

包含對話變體（TxGemma-Chat），可進行自然語言對話並解釋預測背後的邏輯。

微調基礎

可作為預訓練基礎，用於專門用途的進一步微調。

模型能力

治療屬性預測

藥物-目標相互作用預測

臨床試驗批准預測

自然語言對話

預測邏輯解釋

使用案例

藥物發現

血腦屏障穿透預測

根據藥物SMILES字符串預測其是否能穿透血腦屏障。

在BBB_Martins任務上表現優異

目標識別

幫助研究人員識別潛在的治療目標。

治療開發

藥物屬性預測

預測各種治療藥物和目標的屬性。

🚀 TxGemma模型卡片

TxGemma是基於Gemma 2構建的輕量級、最先進的開放語言模型集合，針對治療開發進行了微調。它能夠處理和理解與各種治療方式和靶點相關的信息，可用於藥物發現等多個領域。

🚀 快速開始

模型文檔

TxGemma

資源鏈接

Google Cloud Model Garden上的模型：TxGemma
Hugging Face上的模型：TxGemma
GitHub倉庫（包含支持代碼、Colab筆記本、討論和問題）：TxGemma
快速入門筆記本：notebooks/quick_start
支持信息：請參閱聯繫我們

使用條款

Health AI Developer Foundations使用條款

作者

Google

✨ 主要特性

描述

TxGemma是基於Gemma 2構建的輕量級、最先進的開放語言模型集合，針對治療開發進行了微調。它有2B、9B和27B三種規模。

TxGemma模型旨在處理和理解與各種治療方式和靶點相關的信息，包括小分子、蛋白質、核酸、疾病和細胞系等。TxGemma在屬性預測等任務中表現出色，可作為進一步微調的基礎，或作為藥物發現的交互式對話代理。該模型使用從Therapeutics Data Commons (TDC)精選的多樣化指令調優數據集，從Gemma 2進行微調。

TxGemma既提供期望特定形式提示的預測模型，對於9B和27B版本，還提供更靈活的對話模型，可用於多輪交互，包括解釋預測背後的原理。這種對話模型在一定程度上犧牲了原始預測性能。更多信息請參閱我們的論文。

關鍵特性

多功能性：在廣泛的治療任務中表現出色，在大量基準測試中超越或達到同類最佳性能。
數據效率：與更大的模型相比，即使在數據有限的情況下也能展現出有競爭力的性能，相較於其前身有所改進。
對話能力（TxGemma-Chat）：包含能夠進行自然語言對話並解釋預測推理的對話變體。
微調基礎：可作為預訓練基礎用於特定用例。

潛在應用

TxGemma可以成為以下領域研究人員的寶貴工具：

加速藥物發現：通過預測治療方法和靶點的屬性，簡化治療開發過程，適用於各種任務，包括靶點識別、藥物 - 靶點相互作用預測和臨床試驗批准預測。

💻 使用示例

基礎用法

治療任務提示格式化

import json
from huggingface_hub import hf_hub_download

# Load prompt template for tasks from TDC
tdc_prompts_filepath = hf_hub_download(
    repo_id="google/txgemma-27b-chat",
    filename="tdc_prompts.json",
)
with open(tdc_prompts_filepath, "r") as f:
    tdc_prompts_json = json.load(f)

# Set example TDC task and input
task_name = "BBB_Martins"
input_type = "{Drug SMILES}"
drug_smiles = "CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21"

# Construct prompt using template and input drug SMILES string
TDC_PROMPT = tdc_prompts_json[task_name].replace(input_type, drug_smiles)
print(TDC_PROMPT)

生成的提示符合模型預期的格式：

Instructions: Answer the following question about drug properties.
Context: As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier (BBB) is the protection layer that blocks most foreign drugs. Thus the ability of a drug to penetrate the barrier to deliver to the site of action forms a crucial challenge in development of drugs for central nervous system.
Question: Given a drug SMILES string, predict whether it
(A) does not cross the BBB (B) crosses the BBB
Drug SMILES: CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21
Answer:

在預測任務上運行模型

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-27b-chat",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Prepare tokenized inputs
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(**input_ids, max_new_tokens=8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

或者，您可以使用pipeline API，它提供了一種簡單的方式來運行推理，同時抽象掉加載和使用模型及分詞器的複雜細節：

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-27b-chat",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Generate response
outputs = pipe(prompt, max_new_tokens=8)
response = outputs[0]["generated_text"]
print(response)

應用聊天模板進行對話使用

TxGemma-Chat模型使用聊天模板，在進行對話使用時必須遵循該模板。最簡單的應用方式是使用分詞器的內置聊天模板，如下所示：

# pip install accelerate transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model directly from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("google/txgemma-27b-chat")
model = AutoModelForCausalLM.from_pretrained(
    "google/txgemma-27b-chat",
    device_map="auto",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Format prompt in the conversational format
messages = [
    { "role": "user", "content": prompt}
]

# Apply the tokenizer's built-in chat template
chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

此時，提示包含以下文本：

<bos><start_of_turn>user
Instructions: Answer the following question about drug properties.
Context: As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier (BBB) is the protection layer that blocks most foreign drugs. Thus the ability of a drug to penetrate the barrier to deliver to the site of action forms a crucial challenge in development of drugs for central nervous system.
Question: Given a drug SMILES string, predict whether it
(A) does not cross the BBB (B) crosses the BBB
Drug SMILES: CN1C(=O)CN=C(C2=CCCCC2)c2cc(Cl)ccc21
Answer:<end_of_turn>
<start_of_turn>model

如您所見，每個回合都以<start_of_turn>分隔符開頭，然後是實體的角色（用戶提供的內容為user，大語言模型響應為model）。回合以<end_of_turn>標記結束。

如果需要在不使用分詞器的聊天模板的情況下手動構建提示，您可以遵循此格式。

提示準備好後，可以按以下方式進行生成：

inputs = tokenizer.encode(chat_prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_new_tokens=8)
response = tokenizer.decode(outputs[0, len(inputs[0]):], skip_special_tokens=True)
print(response)

對於多次交互，將模型響應和用戶提示添加到聊天消息歷史記錄中，然後以相同方式進行生成：

messages.extend([
    { "role": "assistant", "content": response },
    { "role": "user", "content": "Explain your reasoning based on the molecule structure." },
])

chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(chat_prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to("cuda"), max_new_tokens=512)
response = tokenizer.decode(outputs[0, len(inputs[0]):], skip_special_tokens=True)
print(response)

或者，您可以使用pipeline API，它抽象掉了使用分詞器應用聊天模板等細節：

# pip install transformers
from transformers import pipeline

# Instantiate a text generation pipeline using the model
pipe = pipeline(
    "text-generation",
    model="google/txgemma-27b-chat",
    device="cuda",
)

# Formatted TDC prompt (see "Formatting prompts for therapeutic tasks" section above)
prompt = TDC_PROMPT

# Format prompt in the conversational format for initial turn
messages = [
    { "role": "user", "content": prompt}
]

# Generate response for initial turn
outputs = pipe(messages, max_new_tokens=8)
print(outputs[0]["generated_text"][-1]["content"].strip())

# Append user prompt for an additional turn
messages = outputs[0]["generated_text"]
messages.append(
    { "role": "user", "content": "Explain your reasoning based on the molecule structure." }
)

# Generate response for additional turn
outputs = pipe(messages, max_new_tokens=512)
print(outputs[0]["generated_text"][-1]["content"].strip())

示例

有關如何使用TxGemma的示例，請參閱以下Colab筆記本：

若要快速嘗試該模型，在本地使用Hugging Face的權重運行，請參閱Colab中的快速入門筆記本，其中包含一些來自TDC的示例評估任務。
若要了解如何在Hugging Face中微調TxGemma，請參閱我們的Colab中的微調筆記本。
若要了解如何將TxGemma作為由Gemini 2驅動的更大代理工作流的一部分使用，請參閱Colab中的代理工作流筆記本。

📚 詳細文檔

模型架構概述

TxGemma基於Gemma 2系列輕量級、最先進的開放大語言模型。它採用僅解碼器的Transformer架構。
基礎模型：Gemma 2（2B、9B和27B參數版本）。
微調數據：Therapeutics Data Commons，這是一個包含多樣化治療方式和靶點的指令調優數據集集合。
訓練方法：使用治療數據（TxT）的混合進行指令微調，對於對話變體，還使用通用指令調優數據。
對話變體：TxGemma-Chat模型（9B和27B）使用治療和通用指令調優數據的混合進行訓練，以保持對話能力。

技術規格

屬性	詳情
模型類型	僅解碼器的Transformer（基於Gemma 2）
關鍵出版物	TxGemma: Efficient and Agentic LLMs for Therapeutics
模型創建時間	2025-03-18（來自TxGemma變體提案）
模型版本	1.0.0

性能與驗證

TxGemma的性能已在從TDC派生的66個治療任務的綜合基準測試中得到驗證。

關鍵性能指標

綜合改進：在66個治療任務中的45個任務上，比原始的Tx-LLM論文有所改進。
同類最佳性能：在66個任務中的50個任務上超越或達到同類最佳性能，在26個任務上超過專業模型。完整明細請參閱TxGemma論文的表A.11。

輸入和輸出

輸入：文本。為獲得最佳性能，文本提示應根據TDC結構進行格式化，包括指令、上下文、問題，以及可選的少樣本示例。輸入可以包括SMILES字符串、氨基酸序列、核苷酸序列和自然語言文本。
輸出：文本。

🔧 技術細節

數據集詳情

訓練數據集

Therapeutics Data Commons：一個精心策劃的指令調優數據集集合，涵蓋66個任務，涉及安全有效藥物的發現和開發。這包括跨不同生物醫學實體的超過1500萬個數據點。已發佈的TxGemma模型僅在具有商業許可證的數據集上進行訓練，而我們論文中的模型還在具有非商業許可證的數據集上進行訓練。
通用指令調優數據：與TDC結合用於TxGemma-Chat。

評估數據集

Therapeutics Data Commons：使用與訓練相同的66個任務進行評估，遵循TDC推薦的數據分割方法（隨機、支架、冷啟動、組合和時間分割）。

實現信息

軟件

訓練使用JAX進行。JAX允許研究人員利用最新一代的硬件，包括TPU，以更快、更高效地訓練大型模型。

使用和限制

預期用途

治療方法的研究和開發。

優點

TxGemma為加速治療開發提供了一個多功能且強大的工具。它具有以下優點：

在廣泛的任務中表現出色。
與更大的模型相比，數據效率更高。
可作為從私有數據進行進一步微調的基礎。
可集成到代理工作流中。

限制

在來自TDC的公共數據上進行訓練。
特定任務的驗證仍然是最終用戶進行下游模型開發的重要方面。
與任何研究一樣，開發人員應確保任何下游應用都經過驗證，以瞭解使用與特定應用預期使用場景（如年齡、性別、狀況、掃描儀等）適當代表的數據的性能。

📄 許可證

TxGemma的使用受Health AI Developer Foundations使用條款的約束。

引用

@article{wang2025txgemma,
    title={TxGemma: Efficient and Agentic LLMs for Therapeutics},
    author={Wang, Eric and Schmidgall, Samuel and Jaeger, Paul F. and Zhang, Fan and Pilgrim, Rory and Matias, Yossi and Barral, Joelle and Fleet, David and Azizi, Shekoofeh},
    year={2025},
}

請在此處查找該論文。