đ NSQL (NSQL-6B)
NSQL is a family of autoregressive open - source large foundation models (FMs) tailored for SQL generation tasks. It offers a practical solution for generating SQL queries from natural language prompts and table schemas.
đ Quick Start
The NSQL model can be easily integrated into your projects. Here are some basic steps to get started:
- Install the necessary libraries (
transformers
in this case).
- Load the tokenizer and the model.
- Prepare your input text with table schemas and natural - language questions.
- Generate SQL queries using the model.
⨠Features
- Specialized for SQL: Specifically designed for SQL generation tasks, ensuring high - quality query outputs.
- Pre - trained and Fine - tuned: Based on CodeGen - Multi 6B, pre - trained on general SQL queries and fine - tuned on text - to - SQL pairs.
- Benchmark - tested: Evaluated on well - known text - to - SQL benchmarks like Spider and GeoQuery.
đĻ Installation
There is no specific installation information provided in the original README. However, you need to have the transformers
library installed to use the model. You can install it using the following command:
pip install transformers
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-6B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-6B")
text = """CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
highest number,
lowest number,
average number
)
CREATE TABLE singer (
singer_id number,
name text,
country text,
song_name text,
song_release_year text,
age number,
is_male others
)
CREATE TABLE concert (
concert_id number,
concert_name text,
theme text,
stadium_id text,
year text
)
CREATE TABLE singer_in_concert (
concert_id number,
singer_id text
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- What is the maximum, the average, and the minimum capacity of stadiums ?
SELECT"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
More Examples
Example 2
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-6B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-6B")
text = """CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- how many stadiums in total?
SELECT"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Example 3
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-6B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-6B")
text = """CREATE TABLE work_orders (
ID NUMBER,
CREATED_AT TEXT,
COST FLOAT,
INVOICE_AMOUNT FLOAT,
IS_DUE BOOLEAN,
IS_OPEN BOOLEAN,
IS_OVERDUE BOOLEAN,
COUNTRY_NAME TEXT,
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- how many work orders are open?
SELECT"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
For more information (e.g., run with your local database), please find examples in this repository.
đ Documentation
Model Description
NSQL is an autoregressive open - source large foundation model family. The checkpoint in this repository is based on CodeGen - Multi 6B from Salesforce. It is first pre - trained on a dataset of general SQL queries and then fine - tuned on a dataset of text - to - SQL pairs.
Training Data
- General SQL Queries: From The Stack, containing 1M training samples.
- Labeled Text - to - SQL Pairs: From more than 20 public sources across the web from standard datasets. Spider and GeoQuery datasets are held out for evaluation.
Evaluation Data
The models are evaluated on two text - to - SQL benchmarks: Spider and GeoQuery.
Training Procedure
NSQL is trained using cross - entropy loss to maximize the likelihood of sequential inputs. For fine - tuning on text - to - SQL pairs, the loss is only computed over the SQL portion of the pair. The models are trained using 80GB A100s, leveraging data and model parallelism. It is pre - trained for 3 epochs and fine - tuned for 10 epochs.
Intended Use and Limitations
The model is designed for text - to - SQL generation tasks from given table schemas and natural language prompts. It works best with the defined prompt format and outputting SELECT
queries.
đ§ Technical Details
- Model Architecture: Based on CodeGen - Multi 6B.
- Training: Pre - trained on general SQL queries and fine - tuned on text - to - SQL pairs.
- Loss Function: Cross - entropy loss.
- Hardware: Trained using 80GB A100s with data and model parallelism.
đ License
The model is licensed under the bsd - 3 - clause license.
Property |
Details |
Model Type |
Autoregressive open - source large foundation model for SQL generation |
Training Data |
General SQL queries from The Stack and labeled text - to - SQL pairs from multiple public sources |