text2cypher-gemma-2-9b-it-finetuned-2024v1 Open Source Model - Free Conversion of Natural Language to Neo4j Query Language

Text2cypher Gemma 2 9b It Finetuned 2024v1

Developed by DavidLanz

A Text2Cypher model fine-tuned on Gemma-2-9b-it for converting natural language to Neo4j graph database query language Cypher

Knowledge Graph English#Natural Language to Cypher #Graph Database Query #Neo4j Specialized

Downloads 70

Release Time : 11/27/2024

Model Overview

This model is fine-tuned on the Neo4j-Text2Cypher dataset, specifically designed to enhance the ability to convert natural language questions into Cypher queries, suitable for graph database interaction scenarios

Model Features

Graph Database Query Conversion

Automatically converts natural language questions into Cypher query statements for Neo4j graph databases

LoRA Fine-Tuning

Utilizes LoRA (Low-Rank Adaptation) technology for efficient fine-tuning

4-bit Quantization

Supports 4-bit quantized inference to reduce resource consumption

Model Capabilities

Natural Language to Cypher Conversion

Graph Database Query Generation

Conversational Query Processing

Use Cases

Graph Database Interaction

Movie Knowledge Graph Query

Converts natural language questions like 'Which movies did Tom Hanks star in?' into Cypher queries

Generates correct MATCH (a:Actor)-[:ActedIn]->(m:Movie) WHERE a.name = 'Tom Hanks' RETURN m query statement

Data Analysis

Relationship Network Analysis

Automatically generates complex relationship network query statements

🚀 Model Card for Model ID

This model demonstrates how fine - tuning foundational models with the Neo4j - Text2Cypher(2024) Dataset can boost performance on the Text2Cypher task. It's part of ongoing research, highlighting the dataset's potential rather than being a production - ready solution.

✨ Features

Demonstrates fine - tuning of foundational models for the Text2Cypher task.
Utilizes the Neo4j - Text2Cypher(2024) Dataset.
Benchmarking results and finetuned model overviews are shared on external links.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "DavidLanz/text2cypher-gemma-2-9b-it-finetuned-2024v1"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="auto",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)"

instruction = (
    "Generate Cypher statement to query a graph database. "
    "Use only the provided relationship types and properties in the schema. \n"
    "Schema: {schema} \n Question: {question}  \n Cypher output: "
)
prompt = instruction.format(schema=schema, question=question)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Generated Cypher Query:", generated_text)

Advanced Usage

def prepare_chat_prompt(question, schema):
    chat = [
        {
            "role": "user",
            "content": instruction.format(
                schema=schema, question=question
            ),
        }
    ]
    return chat

def _postprocess_output_cypher(output_cypher: str) -> str:
    # Remove any explanation or formatting markers
    partition_by = "**Explanation:**"
    output_cypher, _, _ = output_cypher.partition(partition_by)
    output_cypher = output_cypher.strip("`\n")
    output_cypher = output_cypher.lstrip("cypher\n")
    output_cypher = output_cypher.strip("`\n ")
    return output_cypher

new_message = prepare_chat_prompt(question=question, schema=schema)
try:
    prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512)
        chat_generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        final_cypher = _postprocess_output_cypher(chat_generated_text)
        print("Processed Cypher Query:", final_cypher)
except AttributeError:
    print("Error: `apply_chat_template` not supported by this tokenizer. Check compatibility.")

📚 Documentation

Model Details

Model Description

This model shows how fine - tuning foundational models using the Neo4j - Text2Cypher(2024) Dataset (link) can improve performance on the Text2Cypher task. Note that this is part of ongoing research, aiming to highlight the dataset's potential rather than being a production - ready solution.

Base model: google/gemma - 2 - 9b - it Dataset: neo4j/text2cypher - 2024v1

An overview of the finetuned models and benchmarking results are shared at Link1 and Link2

Have ideas or insights? Contact us: [Neo4j/Team - GenAI](mailto:team - gen - ai@neo4j.com)

Training Details

Training Procedure

Used RunPod with the following setup:

1 x A100 PCIe
31 vCPU 117 GB RAM
runpod/pytorch:2.4.0 - py3.11 - cuda12.4.1 - devel - ubuntu22.04
On - Demand - Secure Cloud
60 GB Disk
60 GB Pod Volume

Training Hyperparameters

lora_config = LoraConfig( r = 64, lora_alpha = 64, target_modules = target_modules, lora_dropout = 0.05, bias = "none", task_type = "CAUSAL_LM", )
sft_config = SFTConfig( dataset_text_field = dataset_text_field, per_device_train_batch_size = 4, gradient_accumulation_steps = 8, dataset_num_proc = 16, max_seq_length = 1600, logging_dir = "./logs", num_train_epochs = 1, learning_rate = 2e - 5, save_steps = 5, save_total_limit = 1, logging_steps = 5, output_dir = "outputs", optim = "paged_adamw_8bit", save_strategy = "steps", )
bnb_config = BitsAndBytesConfig( load_in_4bit = True, bnb_4bit_use_double_quant = True, bnb_4bit_quant_type = "nf4", bnb_4bit_compute_dtype = torch.bfloat16, )

Framework versions

PEFT 0.12.0

🔧 Technical Details

We need to be cautious about a few risks:

In our evaluation setup, the training and test sets come from the same data distribution (sampled from a larger dataset). If the data distribution changes, the results may not follow the same pattern.
The datasets used were gathered from publicly available sources. Over time, foundational models may access both the training and test sets, potentially achieving similar or even better results.

Also check the related blogpost: Link

📄 License

License: gemma

Property	Details
Model Type	Not specified
Training Data	neo4j/text2cypher - 2024v1

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご