Datagemma-rag-27b-it Open-source Model - Empowering Large Language Models to Acquire and Integrate Reliable Public Statistical Data

Datagemma Rag 27b It

Developed by google

DataGemma is a series of models fine-tuned based on Gemma 2, specifically designed to help large language models access and integrate reliable public statistical data in Data Commons.

Large Language Model

Transformers

#Retrieval-Augmented Generation #Public Statistical Query #Multi-Question Generation

Downloads 691

Release Time : 8/26/2024

Model Overview

DataGemma RAG uses retrieval-augmented generation technology. After training, it can receive user queries and generate a list of queries that can be understood by the Data Commons natural language interface.

Model Features

Retrieval-Augmented Generation

Capable of generating queries that can be understood by the Data Commons natural language interface

Public Statistical Integration

Specifically designed to access and integrate reliable public statistical data in Data Commons

Structured Question Generation

Capable of generating statistical questions in a specific format based on user queries

Model Capabilities

Natural Language Understanding

Statistical Question Generation

Data Query Conversion

Use Cases

Data Analysis

Demographic Query

Generate queries about demographic data in specific regions

Generate structured questions such as 'What is the permanent population of Sunnyvale?'

Economic Indicator Query

Generate queries about economic indicators (such as the unemployment rate)

Generate structured questions such as 'What is the unemployment rate in California?'

Research Assistance

Social Science Research

Help researchers quickly obtain public statistical data

Automatically generate research questions that meet the requirements of the Data Commons interface

🚀 DataGemma RAG Model Card

DataGemma RAG is a fine - tuned Gemma 2 model series that helps LLMs incorporate reliable public statistical data from Data Commons into responses.

🚀 Quick Start

To access Gemma on Hugging Face, you need to review and agree to Google’s usage license. Ensure you're logged in to Hugging Face and click the "Acknowledge license" button. Requests are processed immediately.

✨ Features

DataGemma is a series of fine - tuned Gemma 2 models. DataGemma RAG, used with Retrieval Augmented Generation, can generate natural - language queries understandable by Data Commons' interface based on user queries.

📚 Documentation

Resources and Technical Documentation

Terms of Use

Terms

Authors

Google

Model Information

Description

DataGemma is a series of fine - tuned Gemma 2 models used to help LLMs access and incorporate reliable public statistical data from Data Commons into their responses. DataGemma RAG is used with Retrieval Augmented Generation, where it is trained to take a user query and generate natural language queries that can be understood by Data Commons' existing natural language interface. More information can be found in this research paper.

Inputs and outputs

Input: Text string containing a user query with a prompt to ask for statistical questions.
Output: A list of natural language queries that can be used to answer the user query and can be understood by Data Commons' existing natural language interface.

Here is an example of a prompt used to get statistical questions for the user query [User Query]:

Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: [User Query]
Statistical Questions:

Model Data

The base model was trained on a dataset of text data that includes a wide variety of sources, see the Gemma 2 documentation for more details. The DataGemma RAG model is fine - tuned on synthetically generated data. More details can be found in the DataGemma paper.

Implementation Information

Like Gemma, DataGemma RAG was trained on TPUv5e, using JAX.

Evaluation

Evaluation on the model was done as part of evaluation on the full RAG workflow and documented in the DataGemma paper.

Ethics and Safety

We are releasing an early version of the models. They are meant for academic and research purposes and are not ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory, behavior. Please anticipate errors and limitations as we actively develop this LLM interface.

We red - teamed and checked the Data Commons Natural Language interface pre - launch against a set of potentially dangerous queries that could result in misleading, controversial, or inflammatory results.
We ran these same queries against the outputs of the RIG and RAG models, finding a few examples where query responses were controversial, but not dangerous.
As this model is meant purely for academic and research purposes, it has not been subjected to our usual safety evaluations.

Usage and Limitations

These models have certain limitations that users should be aware of.

This is a very early version of DataGemma RAG. It is meant for trusted tester use (primarily for academic and research use) and not yet ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory behavior. Please anticipate errors and limitations as we actively develop this large language model interface.

Your feedback and evaluations are critical to refining DataGemma's performance and will directly contribute to its training process. Known limitations are detailed in the DataGemma paper, and we encourage you to consult it for a comprehensive understanding of DataGemma's current capabilities.

📦 Installation

Run on a single/multi GPU

First, make sure to pip install -U transformers accelerate.

Run in 4 - bit via bitsandbytes

First, make sure to pip install -U transformers bitsandbytes accelerate.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

Advanced Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

Example output

What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?

📄 License

Gemma

🔧 Technical Details

Citation

@misc{radhakrishnan2024knowing,
      title={Knowing When to Ask - Bridging Large Language Models and Data}, 
      author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://datacommons.org/link/DataGemmaPaper}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご