Datagemma Rag 27b It_分類| AIbase模型庫

首頁

Datagemma Rag 27b It

由google開發

DataGemma是基於Gemma 2微調的系列模型，專門用於幫助大語言模型訪問和整合Data Commons中的可靠公共統計數據。

大型語言模型

Transformers

#檢索增強生成 #公共統計查詢 #多問題生成

下載量 691

發布時間 : 8/26/2024

模型概述

DataGemma RAG採用檢索增強生成技術，經過訓練後能接收用戶查詢並生成可被Data Commons自然語言接口理解的查詢列表。

模型特點

檢索增強生成

能夠生成可被Data Commons自然語言接口理解的查詢

公共統計整合

專門設計用於訪問和整合Data Commons中的可靠公共統計數據

結構化問題生成

能夠根據用戶查詢生成符合特定格式的統計問題

模型能力

自然語言理解

統計問題生成

數據查詢轉換

使用案例

數據分析

人口統計查詢

生成關於特定地區人口統計數據的查詢

如生成'森尼維爾的常住人口是多少？'等結構化問題

經濟指標查詢

生成關於經濟指標（如失業率）的查詢

如生成'加利福尼亞州的失業率是多少？'等結構化問題

研究輔助

社會科學研究

幫助研究人員快速獲取公共統計數據

自動生成符合Data Commons接口要求的研究問題

🚀 DataGemma RAG模型卡片

DataGemma是一系列經過微調的Gemma 2模型，用於幫助大語言模型（LLMs）在回答中訪問並整合來自Data Commons的可靠公共統計數據，為統計問題的解答提供支持。

🚀 快速開始

運行環境準備

要運行該模型，需要安裝相關依賴庫。以下是不同運行方式下的安裝命令：

單GPU或多GPU運行：

pip install -U transformers accelerate

4位量化運行（使用bitsandbytes）：

pip install -U transformers bitsandbytes accelerate

代碼示例

以下是運行微調模型的代碼片段，這只是DataGemma論文中完整RAG方法的一個步驟。你可以在這個Colab筆記本中嘗試端到端的RAG流程。

單GPU或多GPU運行

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

4位量化運行（使用bitsandbytes）

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

示例輸出

點擊查看示例輸出

What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?

✨ 主要特性

數據整合：DataGemma是一系列經過微調的Gemma 2模型，可幫助大語言模型（LLMs）在回答中訪問並整合來自Data Commons的可靠公共統計數據。
檢索增強生成（RAG）：DataGemma RAG結合了檢索增強生成技術，經過訓練可以接收用戶查詢，並生成能夠被Data Commons現有自然語言接口理解的自然語言查詢。

📚 詳細文檔

模型信息

描述

DataGemma是一系列經過微調的Gemma 2模型，用於幫助大語言模型（LLMs）在回答中訪問並整合來自Data Commons的可靠公共統計數據。DataGemma RAG用於檢索增強生成（RAG），它經過訓練可以接收用戶查詢，並生成能夠被Data Commons現有自然語言接口理解的自然語言查詢。更多信息可參考這篇研究論文。

輸入和輸出

輸入：包含用戶查詢的文本字符串，帶有詢問統計問題的提示。
輸出：一個自然語言查詢列表，可用於回答用戶查詢，並能被Data Commons現有自然語言接口理解。

以下是一個用於為用戶查詢[User Query]獲取統計問題的提示示例：

Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: [User Query]
Statistical Questions:

模型數據

基礎模型在包含多種來源的文本數據集上進行訓練，更多詳細信息請參考Gemma 2文檔。DataGemma RAG模型在合成生成的數據上進行微調。更多詳細信息可參考DataGemma論文。

實現信息

與Gemma類似，DataGemma RAG在TPUv5e上使用JAX進行訓練。

評估

模型評估是作為完整RAG工作流程評估的一部分進行的，並記錄在DataGemma論文中。

倫理與安全

我們正在發佈模型的早期版本，這些模型僅用於學術和研究目的，尚未準備好用於商業或面向公眾使用。此版本在非常小的示例語料庫上進行訓練，可能會出現意外的、有時甚至是有爭議或煽動性的行為。在我們積極開發這個大語言模型接口時，請預期會存在錯誤和侷限性。

我們在發佈前對Data Commons自然語言接口進行了紅隊測試，並針對一組可能導致誤導性、有爭議或煽動性結果的潛在危險查詢進行了檢查。
我們對RIG和RAG模型的輸出運行了相同的查詢，發現有一些查詢響應存在爭議，但並不危險。
由於此模型僅用於學術和研究目的，尚未經過我們通常的安全評估。

使用和限制

這些模型存在一定的侷限性，用戶應該瞭解這些情況。

這是DataGemma RAG的一個非常早期的版本，僅供受信任的測試人員使用（主要用於學術和研究用途），尚未準備好用於商業或面向公眾使用。此版本在非常小的示例語料庫上進行訓練，可能會出現意外的、有時甚至是有爭議或煽動性的行為。在我們積極開發這個大語言模型接口時，請預期會存在錯誤和侷限性。

你的反饋和評估對於改進DataGemma的性能至關重要，並將直接有助於其訓練過程。已知的侷限性在DataGemma論文中有詳細說明，我們鼓勵你查閱該論文以全面瞭解DataGemma的當前能力。

🔧 技術細節

引用

@misc{radhakrishnan2024knowing,
      title={Knowing When to Ask - Bridging Large Language Models and Data}, 
      author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://datacommons.org/link/DataGemmaPaper}, 
}