Text2Cypher-gemma-2-9b-it-finetuned-2024v1開源模型 - 免費將自然語言轉Cypher查詢語句

首頁

Text2cypher Gemma 2 9b It Finetuned 2024v1

由neo4j開發

該模型是基於google/gemma-2-9b-it微調的Text2Cypher模型，能夠將自然語言問題轉換為Neo4j圖數據庫的Cypher查詢語句。

知識圖譜

Safetensors

英語開源協議:Apache-2.0 #自然語言轉Cypher查詢 #Neo4j圖數據庫交互 #LoRA高效微調

下載量 2,093

發布時間 : 9/10/2024

模型概述

該模型展示瞭如何利用Neo4j-Text2Cypher(2024)數據集對基礎模型進行微調，以提升Text2Cypher任務的性能。主要用於將自然語言問題轉換為Cypher查詢語句。

模型特點

高效的自然語言到Cypher轉換

能夠準確地將自然語言問題轉換為有效的Cypher查詢語句

LoRA微調技術

使用參數高效微調技術(LoRA)進行模型適配，保持基礎模型能力的同時提升特定任務表現

4-bit量化支持

支持4-bit量化推理，降低硬件資源需求

模型能力

自然語言理解

Cypher查詢生成

圖數據庫交互

使用案例

圖數據庫查詢

演員電影查詢

查詢特定演員參演的所有電影

生成正確的MATCH (a:Actor)-[:ActedIn]->(m:Movie) RETURN m查詢

複雜關係查詢

查詢滿足特定條件的複雜關係路徑

根據模式生成多跳查詢語句

數據分析

圖數據統計

生成統計圖數據特徵的查詢

生成包含COUNT、SUM等聚合函數的查詢

🚀 文本到Cypher生成模型

本模型展示瞭如何使用Neo4j-Text2Cypher(2024)數據集微調基礎模型，以提升文本到Cypher任務的性能。這是正在進行的研究和探索的一部分，旨在凸顯該數據集的潛力，而非提供一個可用於生產的解決方案。

🚀 快速開始

你可以使用以下代碼示例開始使用該模型：

from peft import PeftModel, PeftConfig
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)


instruction = (
    "Generate Cypher statement to query a graph database. "
    "Use only the provided relationship types and properties in the schema. \n"
    "Schema: {schema} \n Question: {question}  \n Cypher output: "
)


def prepare_chat_prompt(question, schema) -> list[dict]:
    chat = [
        {
            "role": "user",
            "content": instruction.format(
                schema=schema, question=question
            ),
        }
    ]
    return chat

def _postprocess_output_cypher(output_cypher: str) -> str:
    # Remove any explanation. E.g.  MATCH...\n\n**Explanation:**\n\n -> MATCH...
    # Remove cypher indicator. E.g.```cypher\nMATCH...```` --> MATCH...
    # Note: Possible to have both:
    #   E.g. ```cypher\nMATCH...````\n\n**Explanation:**\n\n --> MATCH...
    partition_by = "**Explanation:**"
    output_cypher, _, _ = output_cypher.partition(partition_by)
    output_cypher = output_cypher.strip("`\n")
    output_cypher = output_cypher.lstrip("cypher\n")
    output_cypher = output_cypher.strip("`\n ")
    return output_cypher

# Model
model_name = "neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
)

# Question
question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)" # Check the NOTE below on creating your own schemas
new_message = prepare_chat_prompt(question=question, schema=schema)
prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True)

# Any other parameters
model_generate_parameters = {
    "top_p": 0.9,
    "temperature": 0.2,
    "max_new_tokens": 512,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

inputs.to(model.device)
model.eval()
with torch.no_grad():
    tokens = model.generate(**inputs, **model_generate_parameters)
    tokens = tokens[:, inputs.input_ids.shape[1] :]
    raw_outputs = tokenizer.batch_decode(tokens, skip_special_tokens=True)
    outputs = [_postprocess_output_cypher(output) for output in raw_outputs]
    
print(outputs)
> ["MATCH (a:Actor {Name: 'Tom Hanks'})-[:ActedIn]->(m:Movie) RETURN m"]

✨ 主要特性

該模型展示了使用Neo4j-Text2Cypher(2024)數據集微調基礎模型，以提升文本到Cypher任務性能的方法。
這是正在進行的研究和探索的一部分，旨在凸顯該數據集的潛力。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

# 基礎用法代碼示例
from peft import PeftModel, PeftConfig
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)


instruction = (
    "Generate Cypher statement to query a graph database. "
    "Use only the provided relationship types and properties in the schema. \n"
    "Schema: {schema} \n Question: {question}  \n Cypher output: "
)


def prepare_chat_prompt(question, schema) -> list[dict]:
    chat = [
        {
            "role": "user",
            "content": instruction.format(
                schema=schema, question=question
            ),
        }
    ]
    return chat

def _postprocess_output_cypher(output_cypher: str) -> str:
    # Remove any explanation. E.g.  MATCH...\n\n**Explanation:**\n\n -> MATCH...
    # Remove cypher indicator. E.g.```cypher\nMATCH...```` --> MATCH...
    # Note: Possible to have both:
    #   E.g. ```cypher\nMATCH...````\n\n**Explanation:**\n\n --> MATCH...
    partition_by = "**Explanation:**"
    output_cypher, _, _ = output_cypher.partition(partition_by)
    output_cypher = output_cypher.strip("`\n")
    output_cypher = output_cypher.lstrip("cypher\n")
    output_cypher = output_cypher.strip("`\n ")
    return output_cypher

# Model
model_name = "neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
)

# Question
question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)" # Check the NOTE below on creating your own schemas
new_message = prepare_chat_prompt(question=question, schema=schema)
prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True)

# Any other parameters
model_generate_parameters = {
    "top_p": 0.9,
    "temperature": 0.2,
    "max_new_tokens": 512,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

inputs.to(model.device)
model.eval()
with torch.no_grad():
    tokens = model.generate(**inputs, **model_generate_parameters)
    tokens = tokens[:, inputs.input_ids.shape[1] :]
    raw_outputs = tokenizer.batch_decode(tokens, skip_special_tokens=True)
    outputs = [_postprocess_output_cypher(output) for output in raw_outputs]
    
print(outputs)
> ["MATCH (a:Actor {Name: 'Tom Hanks'})-[:ActedIn]->(m:Movie) RETURN m"]

高級用法

文檔未提及高級用法代碼示例，故跳過此部分。

📚 詳細文檔

模型詳情

本模型展示瞭如何使用Neo4j-Text2Cypher(2024)數據集微調基礎模型，以提升文本到Cypher任務的性能。需要注意的是，這是正在進行的研究和探索的一部分，旨在凸顯該數據集的潛力，而非提供一個可用於生產的解決方案。

基礎模型：google/gemma-2-9b-it 數據集：neo4j/text2cypher-2024v1

微調模型的概述和基準測試結果可在Link1和Link2查看。

如果你有想法或見解，請聯繫我們：Neo4j/Team-GenAI

偏差、風險和侷限性

我們需要注意以下幾點風險：

在我們的評估設置中，訓練集和測試集來自相同的數據分佈（從更大的數據集中採樣）。如果數據分佈發生變化，結果可能不會遵循相同的模式。
所使用的數據集是從公開可用的來源收集的。隨著時間的推移，基礎模型可能會訪問訓練集和測試集，從而可能獲得相似甚至更好的結果。

另請查看相關博客文章：Link

訓練詳情

訓練過程

使用了RunPod，並進行了以下設置：

1 x A100 PCIe
31 vCPU 117 GB RAM
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
On-Demand - Secure Cloud
60 GB Disk
60 GB Pod Volume

訓練超參數

lora_config = LoraConfig( r=64, lora_alpha=64, target_modules=target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
sft_config = SFTConfig( dataset_text_field=dataset_text_field, per_device_train_batch_size=4, gradient_accumulation_steps=8, dataset_num_proc=16, max_seq_length=1600, logging_dir="./logs", num_train_epochs=1, learning_rate=2e-5, save_steps=5, save_total_limit=1, logging_steps=5, output_dir="outputs", optim="paged_adamw_8bit", save_strategy="steps", )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

框架版本

PEFT 0.12.0

關於創建自己的模式的注意事項

在我們使用的數據集中，模式已經提供。它們可以通過以下方式創建：
- 直接使用輸入數據源提供的模式；
- 使用neo4j-graphrag包創建模式（請查看：SchemaReader.get_schema(...)函數）。
在你自己的Neo4j數據庫中，你可以使用neo4j-graphrag package::SchemaReader函數。