模型概述
模型特點
模型能力
使用案例
🚀 DataGemma RAG模型卡片
DataGemma是一系列經過微調的Gemma 2模型,用於幫助大語言模型(LLMs)在回答中訪問並整合來自Data Commons的可靠公共統計數據,為統計問題的解答提供支持。
🚀 快速開始
運行環境準備
要運行該模型,需要安裝相關依賴庫。以下是不同運行方式下的安裝命令:
- 單GPU或多GPU運行:
pip install -U transformers accelerate
- 4位量化運行(使用bitsandbytes):
pip install -U transformers bitsandbytes accelerate
代碼示例
以下是運行微調模型的代碼片段,這只是DataGemma論文中完整RAG方法的一個步驟。你可以在這個Colab筆記本中嘗試端到端的RAG流程。
單GPU或多GPU運行
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator. Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
education, environment, etc. Examples are unemployment rate and
life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
districts, etc.
Your response should only have questions, one per line, without any numbering
or bullet.
If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.
Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
4位量化運行(使用bitsandbytes)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator. Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
education, environment, etc. Examples are unemployment rate and
life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
districts, etc.
Your response should only have questions, one per line, without any numbering
or bullet.
If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.
Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
示例輸出
點擊查看示例輸出
What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?
✨ 主要特性
- 數據整合:DataGemma是一系列經過微調的Gemma 2模型,可幫助大語言模型(LLMs)在回答中訪問並整合來自Data Commons的可靠公共統計數據。
- 檢索增強生成(RAG):DataGemma RAG結合了檢索增強生成技術,經過訓練可以接收用戶查詢,並生成能夠被Data Commons現有自然語言接口理解的自然語言查詢。
📚 詳細文檔
模型信息
描述
DataGemma是一系列經過微調的Gemma 2模型,用於幫助大語言模型(LLMs)在回答中訪問並整合來自Data Commons的可靠公共統計數據。DataGemma RAG用於檢索增強生成(RAG),它經過訓練可以接收用戶查詢,並生成能夠被Data Commons現有自然語言接口理解的自然語言查詢。更多信息可參考這篇研究論文。
輸入和輸出
- 輸入:包含用戶查詢的文本字符串,帶有詢問統計問題的提示。
- 輸出:一個自然語言查詢列表,可用於回答用戶查詢,並能被Data Commons現有自然語言接口理解。
以下是一個用於為用戶查詢[User Query]
獲取統計問題的提示示例:
Your role is that of a Question Generator. Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should a metric on societal topics like demographics, economy, health,
education, environment, etc. Examples are unemployment rate and
life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
districts, etc.
Your response should only have questions, one per line, without any numbering
or bullet.
If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.
Query: [User Query]
Statistical Questions:
模型數據
基礎模型在包含多種來源的文本數據集上進行訓練,更多詳細信息請參考Gemma 2文檔。DataGemma RAG模型在合成生成的數據上進行微調。更多詳細信息可參考DataGemma論文。
實現信息
與Gemma類似,DataGemma RAG在TPUv5e上使用JAX進行訓練。
評估
模型評估是作為完整RAG工作流程評估的一部分進行的,並記錄在DataGemma論文中。
倫理與安全
我們正在發佈模型的早期版本,這些模型僅用於學術和研究目的,尚未準備好用於商業或面向公眾使用。此版本在非常小的示例語料庫上進行訓練,可能會出現意外的、有時甚至是有爭議或煽動性的行為。在我們積極開發這個大語言模型接口時,請預期會存在錯誤和侷限性。
- 我們在發佈前對Data Commons自然語言接口進行了紅隊測試,並針對一組可能導致誤導性、有爭議或煽動性結果的潛在危險查詢進行了檢查。
- 我們對RIG和RAG模型的輸出運行了相同的查詢,發現有一些查詢響應存在爭議,但並不危險。
- 由於此模型僅用於學術和研究目的,尚未經過我們通常的安全評估。
使用和限制
這些模型存在一定的侷限性,用戶應該瞭解這些情況。
這是DataGemma RAG的一個非常早期的版本,僅供受信任的測試人員使用(主要用於學術和研究用途),尚未準備好用於商業或面向公眾使用。此版本在非常小的示例語料庫上進行訓練,可能會出現意外的、有時甚至是有爭議或煽動性的行為。在我們積極開發這個大語言模型接口時,請預期會存在錯誤和侷限性。
你的反饋和評估對於改進DataGemma的性能至關重要,並將直接有助於其訓練過程。已知的侷限性在DataGemma論文中有詳細說明,我們鼓勵你查閱該論文以全面瞭解DataGemma的當前能力。
🔧 技術細節
引用
@misc{radhakrishnan2024knowing,
title={Knowing When to Ask - Bridging Large Language Models and Data},
author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
year={2024},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://datacommons.org/link/DataGemmaPaper},
}
📄 許可證
許可證信息:gemma
資源與技術文檔
使用條款
作者
訪問Gemma
若要在Hugging Face上訪問Gemma,你需要審查並同意Google的使用許可證。為此,請確保你已登錄Hugging Face並點擊下方按鈕。請求將立即處理。 [確認許可證](Acknowledge license)
模型標籤
- 對話式



