Datagemma-rag-27b-itオープンソースモデル - 大規模言語モデルが信頼できる公共統計データを取得・統合するのを支援

ホーム

Datagemma Rag 27b It

googleによって開発

DataGemmaはGemma 2をファインチューニングしたシリーズモデルで、大規模言語モデルがData Commons内の信頼できる公共統計データにアクセスし、統合するのを支援するために特別に設計されています。

大規模言語モデル

Transformers

#検索強化生成 #公共統計クエリ #複数質問生成

ダウンロード数 691

リリース時間 : 8/26/2024

モデル概要

DataGemma RAGは検索強化生成技術を採用し、ユーザーのクエリを受け取り、Data Commonsの自然言語インターフェースが理解できるクエリリストを生成するように訓練されています。

モデル特徴

検索強化生成

Data Commonsの自然言語インターフェースが理解できるクエリを生成できる

公共統計統合

Data Commons内の信頼できる公共統計データにアクセスし、統合するように特別に設計されている

構造化質問生成

ユーザーのクエリに基づいて特定の形式に合った統計質問を生成できる

モデル能力

自然言語理解

統計質問生成

データクエリ変換

使用事例

データ分析

人口統計クエリ

特定地域の人口統計データに関するクエリを生成する

例えば「サニベールの常住人口は何人ですか？」などの構造化質問を生成する

経済指標クエリ

失業率などの経済指標に関するクエリを生成する

例えば「カリフォルニア州の失業率は何％ですか？」などの構造化質問を生成する

研究支援

社会科学研究

研究者が公共統計データを迅速に取得するのを支援する

Data Commonsのインターフェース要件に合った研究質問を自動生成する

🚀 DataGemma RAGモデルカード

DataGemma RAGは、LLMがData Commonsから信頼できる公的統計データを取得し、それを応答に組み込むのを支援するためのモデルです。このモデルは、ユーザーのクエリから自然言語クエリを生成し、Data Commonsの既存の自然言語インターフェースで理解できるようにします。

🔗 リソースと技術ドキュメント

📄 利用規約

利用規約

✍️ 作者

Google

✨ 主な機能

説明

DataGemmaは、Gemma 2モデルをファインチューニングしたシリーズで、LLMがData Commonsから信頼できる公的統計データを取得し、それを応答に組み込むのを支援します。DataGemma RAGは、Retrieval Augmented Generationとともに使用され、ユーザーのクエリから自然言語クエリを生成し、Data Commonsの既存の自然言語インターフェースで理解できるようにトレーニングされています。詳細については、この研究論文を参照してください。

入力と出力

入力: 統計的質問を求めるプロンプト付きのユーザークエリを含むテキスト文字列。
出力: ユーザークエリに回答するために使用でき、Data Commonsの既存の自然言語インターフェースで理解できる自然言語クエリのリスト。

以下は、ユーザークエリ [ユーザークエリ] の統計的質問を取得するために使用されるプロンプトの例です。

あなたの役割は質問生成器です。以下のクエリを元に、クエリに回答するのに役立つ最大25の統計的質問を考えてください。

生成できる統計的質問の形式は以下のみです。
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

ここで、
- $METRICは、人口統計、経済、健康、教育、環境などの社会的トピックに関する指標である必要があります。例としては、失業率や平均寿命などがあります。
- $PLACEは、カリフォルニア、世界、チェンナイなどの場所の名前です。
- $PLACE_TYPEは、$PLACE内の直下の子タイプで、郡、州、地区などです。

あなたの応答には質問のみを含め、行番号や箇条書きは付けないでください。

クエリに対する統計的質問を考えることができない場合は、空の応答を返してください。

クエリ: [ユーザークエリ]
統計的質問:

💻 使用例

基本的な使用法

以下は、ファインチューニングされたモデルを実行するためのコードスニペットです。これは、DataGemma論文で説明されている完全なRAGアプローチの1つのステップに過ぎません。エンドツーエンドのRAGフローをこのcolabノートブックで試すことができます。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

出力例

What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?

高度な使用法

4ビットでモデルを実行するには、まず pip install -U transformers bitsandbytes accelerate を実行し、以下のコードスニペットをコピーしてください。

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

📚 ドキュメント

モデルデータ

ベースモデルは、様々なソースを含むテキストデータのデータセットでトレーニングされました。詳細については、Gemma 2ドキュメントを参照してください。DataGemma RAGモデルは、合成生成データでファインチューニングされています。詳細は、DataGemma論文で確認できます。

実装情報

Gemmaと同様に、DataGemma RAGはTPUv5e上で、JAXを使用してトレーニングされました。

評価

モデルの評価は、完全なRAGワークフローの評価の一部として行われ、DataGemma論文に記載されています。

倫理と安全性

これはモデルの初期バージョンです。学術および研究目的での使用を想定しており、商用または一般公開にはまだ適していません。このバージョンは非常に少ないコーパスの例でトレーニングされており、意図しない、時には物議を醸すまたは炎症的な振る舞いを示す可能性があります。このLLMインターフェースを積極的に開発している間、エラーや制限を予期してください。

事前に、Data Commonsの自然言語インターフェースに対して、誤解を招く、物議を醸す、または炎症的な結果をもたらす可能性のある一連の潜在的に危険なクエリに対してレッドチーミングを行い、チェックしました。
同じクエリをRIGおよびRAGモデルの出力に対して実行し、クエリ応答が物議を醸すが危険ではないいくつかの例を見つけました。
このモデルは純粋に学術および研究目的で使用することを想定しているため、通常の安全性評価を受けていません。

使用と制限

これらのモデルには、ユーザーが認識しておく必要がある特定の制限があります。

これはDataGemma RAGの非常に初期のバージョンです。信頼できるテスターによる使用（主に学術および研究目的）を想定しており、まだ商用または一般公開には適していません。このバージョンは非常に少ないコーパスの例でトレーニングされており、意図しない、時には物議を醸すまたは炎症的な振る舞いを示す可能性があります。この大規模言語モデルインターフェースを積極的に開発している間、エラーや制限を予期してください。

あなたのフィードバックと評価は、DataGemmaのパフォーマンスを改善するために重要であり、トレーニングプロセスに直接貢献します。既知の制限はDataGemma論文に詳細に記載されており、DataGemmaの現在の機能を包括的に理解するために参照することをお勧めします。

引用

@misc{radhakrishnan2024knowing,
      title={Knowing When to Ask - Bridging Large Language Models and Data}, 
      author={Prashanth Radhakrishnan and Jennifer Chen and Bo Xu and Prem Ramaswami and Hannah Pho and Adriana Olmos and James Manyika and R. V. Guha},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://datacommons.org/link/DataGemmaPaper}, 
}