text2cypher-gemma-2-9b-it-finetuned-2024v1オープンソースモデル - 自然言語を無料でNeo4jクエリ言語に変換

ホーム

Text2cypher Gemma 2 9b It Finetuned 2024v1

DavidLanzによって開発

Gemma-2-9b-itをファインチューニングしたText2Cypherモデルで、自然言語をNeo4jグラフデータベースクエリ言語Cypherに変換

知識グラフ英語#自然言語からCypherへ #グラフデータベースクエリ #Neo4j専用

ダウンロード数 70

リリース時間 : 11/27/2024

モデル概要

このモデルはNeo4j-Text2Cypherデータセットでファインチューニングされ、自然言語の質問をCypherクエリ文に変換する能力を向上させるために特別に設計されており、グラフデータベースインタラクションシナリオに適しています

モデル特徴

グラフデータベースクエリ変換

自然言語の質問をNeo4jグラフデータベースのCypherクエリ文に自動変換

LoRAファインチューニング

LoRA(Low-Rank Adaptation)技術を採用した効率的なファインチューニング

4-bit量子化

4-bit量子化推論をサポートし、リソース消費を削減

モデル能力

自然言語からCypherへの変換

グラフデータベースクエリ生成

対話型クエリ処理

使用事例

グラフデータベースインタラクション

映画知識グラフクエリ

'トム・ハンクスはどの映画に出演した？'のような自然言語質問をCypherクエリに変換

正しいMATCH (a:Actor)-[:ActedIn]->(m:Movie) WHERE a.name = 'トム・ハンクス' RETURN mクエリ文を生成

データ分析

関係ネットワーク分析

複雑な関係ネットワーククエリ文を自動生成

🚀 モデルIDのモデルカード

このモデルは、Neo4j-Text2Cypher(2024)データセット(リンク)を使用して基礎モデルをファインチューニングすることで、Text2Cypherタスクのパフォーマンスを向上させる方法を示すデモンストレーションとして機能します。注意：これは進行中の研究と探索の一部であり、データセットの潜在能力を強調することを目的としており、本番環境で使用できるソリューションではありません。

✨ 主な機能

このモデルは、Text2Cypherタスクにおけるパフォーマンス向上を目指し、Neo4j-Text2Cypher(2024)データセットを用いて基礎モデルをファインチューニングしたものです。研究用途で、データセットの潜在能力を探るために開発されています。

📦 インストール

フレームワークバージョン

PEFT 0.12.0

💻 使用例

基本的な使用法

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "DavidLanz/text2cypher-gemma-2-9b-it-finetuned-2024v1"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)"

instruction = (
    "Generate Cypher statement to query a graph database. "
    "Use only the provided relationship types and properties in the schema. \n"
    "Schema: {schema} \n Question: {question}  \n Cypher output: "
)
prompt = instruction.format(schema=schema, question=question)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Generated Cypher Query:", generated_text)

def prepare_chat_prompt(question, schema):
    chat = [
        {
            "role": "user",
            "content": instruction.format(
                schema=schema, question=question
            ),
        }
    ]
    return chat

def _postprocess_output_cypher(output_cypher: str) -> str:
    # Remove any explanation or formatting markers
    partition_by = "**Explanation:**"
    output_cypher, _, _ = output_cypher.partition(partition_by)
    output_cypher = output_cypher.strip("`\n")
    output_cypher = output_cypher.lstrip("cypher\n")
    output_cypher = output_cypher.strip("`\n ")
    return output_cypher

new_message = prepare_chat_prompt(question=question, schema=schema)
try:
    prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512)
        chat_generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        final_cypher = _postprocess_output_cypher(chat_generated_text)
        print("Processed Cypher Query:", final_cypher)
except AttributeError:
    print("Error: `apply_chat_template` not supported by this tokenizer. Check compatibility.")

📚 ドキュメント

モデル詳細

モデル説明

ベースモデル: google/gemma-2-9b-it データセット: neo4j/text2cypher-2024v1

ファインチューニングされたモデルの概要とベンチマーク結果は、リンク1 とリンク2 で共有されています。

アイデアや洞察がある場合は、Neo4j/Team-GenAI までご連絡ください。

トレーニング詳細

トレーニング手順

次の設定でRunPodを使用しました：

1 x A100 PCIe
31 vCPU 117 GB RAM
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
On-Demand - Secure Cloud
60 GB Disk
60 GB Pod Volume

トレーニングハイパーパラメータ

lora_config = LoraConfig( r=64, lora_alpha=64, target_modules=target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
sft_config = SFTConfig( dataset_text_field=dataset_text_field, per_device_train_batch_size=4, gradient_accumulation_steps=8, dataset_num_proc=16, max_seq_length=1600, logging_dir="./logs", num_train_epochs=1, learning_rate=2e-5, save_steps=5, save_total_limit=1, logging_steps=5, output_dir="outputs", optim="paged_adamw_8bit", save_strategy="steps", )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

🔧 技術詳細

バイアス、リスク、制限事項

いくつかのリスクには注意が必要です。

評価設定では、トレーニングセットとテストセットは同じデータ分布（より大きなデータセットからサンプリング）に由来しています。データ分布が変化すると、結果が同じパターンに従わない可能性があります。
使用されたデータセットは公開されているソースから収集されたものです。時間が経つにつれて、基礎モデルがトレーニングセットとテストセットの両方にアクセスできるようになり、同じまたはそれ以上の結果を達成する可能性があります。

関連するブログ記事も確認してください：リンク