đ Model Card for Model ID
This model demonstrates how fine - tuning foundational models with the Neo4j - Text2Cypher(2024) Dataset can boost performance on the Text2Cypher task. It's part of ongoing research, highlighting the dataset's potential rather than being a production - ready solution.
⨠Features
- Demonstrates fine - tuning of foundational models for the Text2Cypher task.
- Utilizes the Neo4j - Text2Cypher(2024) Dataset.
- Benchmarking results and finetuned model overviews are shared on external links.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "DavidLanz/text2cypher-gemma-2-9b-it-finetuned-2024v1"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
question = "What are the movies of Tom Hanks?"
schema = "(:Actor)-[:ActedIn]->(:Movie)"
instruction = (
"Generate Cypher statement to query a graph database. "
"Use only the provided relationship types and properties in the schema. \n"
"Schema: {schema} \n Question: {question} \n Cypher output: "
)
prompt = instruction.format(schema=schema, question=question)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Cypher Query:", generated_text)
Advanced Usage
def prepare_chat_prompt(question, schema):
chat = [
{
"role": "user",
"content": instruction.format(
schema=schema, question=question
),
}
]
return chat
def _postprocess_output_cypher(output_cypher: str) -> str:
partition_by = "**Explanation:**"
output_cypher, _, _ = output_cypher.partition(partition_by)
output_cypher = output_cypher.strip("`\n")
output_cypher = output_cypher.lstrip("cypher\n")
output_cypher = output_cypher.strip("`\n ")
return output_cypher
new_message = prepare_chat_prompt(question=question, schema=schema)
try:
prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
chat_generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
final_cypher = _postprocess_output_cypher(chat_generated_text)
print("Processed Cypher Query:", final_cypher)
except AttributeError:
print("Error: `apply_chat_template` not supported by this tokenizer. Check compatibility.")
đ Documentation
Model Details
Model Description
This model shows how fine - tuning foundational models using the Neo4j - Text2Cypher(2024) Dataset (link) can improve performance on the Text2Cypher task. Note that this is part of ongoing research, aiming to highlight the dataset's potential rather than being a production - ready solution.
Base model: google/gemma - 2 - 9b - it
Dataset: neo4j/text2cypher - 2024v1
An overview of the finetuned models and benchmarking results are shared at Link1 and Link2
Have ideas or insights? Contact us: [Neo4j/Team - GenAI](mailto:team - gen - ai@neo4j.com)
Training Details
Training Procedure
Used RunPod with the following setup:
- 1 x A100 PCIe
- 31 vCPU 117 GB RAM
- runpod/pytorch:2.4.0 - py3.11 - cuda12.4.1 - devel - ubuntu22.04
- On - Demand - Secure Cloud
- 60 GB Disk
- 60 GB Pod Volume
Training Hyperparameters
- lora_config = LoraConfig(
r = 64,
lora_alpha = 64,
target_modules = target_modules,
lora_dropout = 0.05,
bias = "none",
task_type = "CAUSAL_LM",
)
- sft_config = SFTConfig(
dataset_text_field = dataset_text_field,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
dataset_num_proc = 16,
max_seq_length = 1600,
logging_dir = "./logs",
num_train_epochs = 1,
learning_rate = 2e - 5,
save_steps = 5,
save_total_limit = 1,
logging_steps = 5,
output_dir = "outputs",
optim = "paged_adamw_8bit",
save_strategy = "steps",
)
- bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_use_double_quant = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_compute_dtype = torch.bfloat16,
)
Framework versions
đ§ Technical Details
We need to be cautious about a few risks:
- In our evaluation setup, the training and test sets come from the same data distribution (sampled from a larger dataset). If the data distribution changes, the results may not follow the same pattern.
- The datasets used were gathered from publicly available sources. Over time, foundational models may access both the training and test sets, potentially achieving similar or even better results.
Also check the related blogpost: Link
đ License
License: gemma
Property |
Details |
Model Type |
Not specified |
Training Data |
neo4j/text2cypher - 2024v1 |