模型简介
模型特点
模型能力
使用案例
🚀 DataGemma RAG模型卡片
DataGemma是一系列经过微调的Gemma 2模型,用于帮助大语言模型(LLMs)在回答中访问并整合来自Data Commons的可靠公共统计数据,为统计问题的解答提供支持。
🚀 快速开始
运行环境准备
要运行该模型,需要安装相关依赖库。以下是不同运行方式下的安装命令:
- 单GPU或多GPU运行:
pip install -U transformers accelerate
- 4位量化运行(使用bitsandbytes):
pip install -U transformers bitsandbytes accelerate
代码示例
以下是运行微调模型的代码片段,这只是DataGemma论文中完整RAG方法的一个步骤。你可以在这个Colab笔记本中尝试端到端的RAG流程。
单GPU或多GPU运行
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator. Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
education, environment, etc. Examples are unemployment rate and
life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
districts, etc.
Your response should only have questions, one per line, without any numbering
or bullet.
If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.
Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
4位量化运行(使用bitsandbytes)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator. Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
education, environment, etc. Examples are unemployment rate and
life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
districts, etc.
Your response should only have questions, one per line, without any numbering
or bullet.
If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.
Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)
示例输出
点击查看示例输出
What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?
✨ 主要特性
- 数据整合:DataGemma是一系列经过微调的Gemma 2模型,可帮助大语言模型(LLMs)在回答中访问并整合来自Data Commons的可靠公共统计数据。
- 检索增强生成(RAG):DataGemma RAG结合了检索增强生成技术,经过训练可以接收用户查询,并生成能够被Data Commons现有自然语言接口理解的自然语言查询。
📚 详细文档
模型信息
描述
DataGemma是一系列经过微调的Gemma 2模型,用于帮助大语言模型(LLMs)在回答中访问并整合来自Data Commons的可靠公共统计数据。DataGemma RAG用于检索增强生成(RAG),它经过训练可以接收用户查询,并生成能够被Data Commons现有自然语言接口理解的自然语言查询。更多信息可参考这篇研究论文。
输入和输出
- 输入:包含用户查询的文本字符串,带有询问统计问题的提示。
- 输出:一个自然语言查询列表,可用于回答用户查询,并能被Data Commons现有自然语言接口理解。
以下是一个用于为用户查询[User Query]
获取统计问题的提示示例:
Your role is that of a Question Generator. Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should a metric on societal topics like demographics, economy, health,
education, environment, etc. Examples are unemployment rate and
life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
districts, etc.
Your response should only have questions, one per line, without any numbering
or bullet.
If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.
Query: [User Query]
Statistical Questions:
模型数据
基础模型在包含多种来源的文本数据集上进行训练,更多详细信息请参考Gemma 2文档。DataGemma RAG模型在合成生成的数据上进行微调。更多详细信息可参考DataGemma论文。
实现信息
与Gemma类似,DataGemma RAG在TPUv5e上使用JAX进行训练。
评估
模型评估是作为完整RAG工作流程评估的一部分进行的,并记录在DataGemma论文中。
伦理与安全
我们正在发布模型的早期版本,这些模型仅用于学术和研究目的,尚未准备好用于商业或面向公众使用。此版本在非常小的示例语料库上进行训练,可能会出现意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时,请预期会存在错误和局限性。
- 我们在发布前对Data Commons自然语言接口进行了红队测试,并针对一组可能导致误导性、有争议或煽动性结果的潜在危险查询进行了检查。
- 我们对RIG和RAG模型的输出运行了相同的查询,发现有一些查询响应存在争议,但并不危险。
- 由于此模型仅用于学术和研究目的,尚未经过我们通常的安全评估。
使用和限制
这些模型存在一定的局限性,用户应该了解这些情况。
这是DataGemma RAG的一个非常早期的版本,仅供受信任的测试人员使用(主要用于学术和研究用途),尚未准备好用于商业或面向公众使用。此版本在非常小的示例语料库上进行训练,可能会出现意外的、有时甚至是有争议或煽动性的行为。在我们积极开发这个大语言模型接口时,请预期会存在错误和局限性。
你的反馈和评估对于改进DataGemma的性能至关重要,并将直接有助于其训练过程。已知的局限性在DataGemma论文中有详细说明,我们鼓励你查阅该论文以全面了解DataGemma的当前能力。
🔧 技术细节
引用
@misc{radhakrishnan2024knowing,
title={Knowing When to Ask - Bridging Large Language Models and Data},
author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
year={2024},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://datacommons.org/link/DataGemmaPaper},
}
📄 许可证
许可证信息:gemma
资源与技术文档
使用条款
作者
访问Gemma
若要在Hugging Face上访问Gemma,你需要审查并同意Google的使用许可证。为此,请确保你已登录Hugging Face并点击下方按钮。请求将立即处理。 [确认许可证](Acknowledge license)
模型标签
- 对话式



