đ DuckDB-NSQL-7B
NSQL is a family of autoregressive open - source large foundation models (FMs) specifically designed for SQL generation tasks. DuckDB - NSQL, a new member of the NSQL family introduced in this repository, is based on Meta's original [Llama - 2 7B model](https://huggingface.co/meta - llama/Llama - 2 - 7b). It is further pre - trained on a dataset of general SQL queries and then fine - tuned on a dataset composed of DuckDB text - to - SQL pairs.
đ Quick Start
Model Information
Property |
Details |
License |
llama2 |
Base Model |
meta - llama/Llama - 2 - 7b |
Inference Parameters |
do_sample: false, max_length: 200 |
Usage Examples
You can use the following code examples to generate valid DuckDB SQL based on different inputs.
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
text = """### Instruction:
Your task is to generate valid duckdb SQL to answer the following question.
### Input:
### Question:
create a new table called tmp from test.csv
### Response (use duckdb shorthand if possible):
"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Advanced Usage
Example 2
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
text = """### Instruction:
Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
### Input:
Here is the database schema that the SQL query will run on:
CREATE TABLE taxi (
VendorID bigint,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp,
passenger_count double,
trip_distance double,
fare_amount double,
extra double,
tip_amount double,
tolls_amount double,
improvement_surcharge double,
total_amount double,
);
### Question:
get all columns ending with _amount from taxi table
### Response (use duckdb shorthand if possible):"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Example 3
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("motherduckdb/DuckDB-NSQL-7B-v0.1", torch_dtype=torch.bfloat16)
text = """### Instruction:
Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.
### Input:
Here is the database schema that the SQL query will run on:
CREATE TABLE rideshare (
hvfhs_license_num varchar,
dispatching_base_num varchar,
originating_base_num varchar,
request_datetime timestamp,
on_scene_datetime timestamp,
pickup_datetime timestamp,
dropoff_datetime timestamp,
trip_miles double,
trip_time bigint,
);
### Question:
get longest trip in december 2022
### Response (use duckdb shorthand if possible):
"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
For more information (e.g., run with your local database), please find examples in this repository.
⨠Features
- Specifically designed for SQL generation tasks, especially for generating valid DuckDB SQL.
- Not constrained to
SELECT
statements, can generate any valid DuckDB SQL statement, including those for official DuckDB extensions.
đ Documentation
Training Data
The model is trained on 200k DuckDB text - to - SQL pairs. These pairs are synthetically generated using [Mixtral - 8x7B - Instruct - v0.1](https://huggingface.co/mistralai/Mixtral - 8x7B - Instruct - v0.1) and guided by the DuckDB v0.9.2 documentation. Additionally, text - to - SQL pairs from NSText2SQL are transpiled to DuckDB SQL using sqlglot.
Evaluation Data
The models are evaluated on a DuckDB - specific benchmark containing 75 text - to - SQL pairs. The benchmark is available [here](https://github.com/NumbersStationAI/DuckDB - NSQL/).
Training Procedure
DuckDB - NSQL is trained using cross - entropy loss to maximize the likelihood of sequential inputs. For fine - tuning on text - to - SQL pairs, the loss is only computed over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism, and fine - tuned for 10 epochs.
Intended Use and Limitations
The model is designed for text - to - SQL generation tasks from given table schema and natural language prompts. It works best with the defined prompt format and outputs.
đ License
The model uses the llama2 license.