NSQL-6B Open-Source Large SQL Generation Model - Free to Assist in the Rapid Generation of SQL Statements

Nsql 6B

Developed by NumbersStation

NSQL-6B is an autoregressive large language model base specifically designed for SQL generation tasks, pre-trained and fine-tuned based on Salesforce's CodeGen-Multi 6B model.

Large Language Model

Transformers

Open Source License:Bsd-3-clause #SQL Auto-generation #Database Query Optimization #Text-to-SQL

Downloads 198

Release Time : 7/4/2023

Model Overview

NSQL-6B is a model focused on generating SQL queries based on table structures and natural language prompts, with particular expertise in outputting SELECT statements.

Model Features

SQL Generation Optimization

Specially optimized for SQL query generation tasks, particularly excelling in generating SELECT statements.

Multi-source Data Training

Trained using SQL subsets from The Stack and standard datasets from over 20 public sources.

SQLite Compatibility

Generated SQL queries comply with SQLite syntax specifications.

Model Capabilities

SQL Query Generation

Table Structure Understanding

Natural Language to SQL Conversion

Use Cases

Database Query

Stadium Capacity Analysis

Generates SQL statements to query the maximum, average, and minimum capacities of stadiums based on table structures.

Generates valid SQL query statements

Ticket Status Statistics

Generates SQL statements to query the number of open-status tickets.

Generates valid SQL query statements

🚀 NSQL (NSQL-6B)

NSQL is a family of autoregressive open - source large foundation models (FMs) tailored for SQL generation tasks. It offers a practical solution for generating SQL queries from natural language prompts and table schemas.

🚀 Quick Start

The NSQL model can be easily integrated into your projects. Here are some basic steps to get started:

Install the necessary libraries (transformers in this case).
Load the tokenizer and the model.
Prepare your input text with table schemas and natural - language questions.
Generate SQL queries using the model.

✨ Features

Specialized for SQL: Specifically designed for SQL generation tasks, ensuring high - quality query outputs.
Pre - trained and Fine - tuned: Based on CodeGen - Multi 6B, pre - trained on general SQL queries and fine - tuned on text - to - SQL pairs.
Benchmark - tested: Evaluated on well - known text - to - SQL benchmarks like Spider and GeoQuery.

📦 Installation

There is no specific installation information provided in the original README. However, you need to have the transformers library installed to use the model. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-6B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-6B")

text = """CREATE TABLE stadium (
    stadium_id number,
    location text,
    name text,
    capacity number,
    highest number,
    lowest number,
    average number
)

CREATE TABLE singer (
    singer_id number,
    name text,
    country text,
    song_name text,
    song_release_year text,
    age number,
    is_male others
)

CREATE TABLE concert (
    concert_id number,
    concert_name text,
    theme text,
    stadium_id text,
    year text
)

CREATE TABLE singer_in_concert (
    concert_id number,
    singer_id text
)

-- Using valid SQLite, answer the following questions for the tables provided above.

-- What is the maximum, the average, and the minimum capacity of stadiums ?

SELECT"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

More Examples

Example 2

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-6B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-6B")

text = """CREATE TABLE stadium (
    stadium_id number,
    location text,
    name text,
    capacity number,
)

-- Using valid SQLite, answer the following questions for the tables provided above.

-- how many stadiums in total?

SELECT"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Example 3

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-6B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-6B")

text = """CREATE TABLE work_orders (
    ID NUMBER,
    CREATED_AT TEXT,
    COST FLOAT,
    INVOICE_AMOUNT FLOAT,
    IS_DUE BOOLEAN,
    IS_OPEN BOOLEAN,
    IS_OVERDUE BOOLEAN,
    COUNTRY_NAME TEXT,
)

-- Using valid SQLite, answer the following questions for the tables provided above.

-- how many work orders are open?

SELECT"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

For more information (e.g., run with your local database), please find examples in this repository.

📚 Documentation

Model Description

NSQL is an autoregressive open - source large foundation model family. The checkpoint in this repository is based on CodeGen - Multi 6B from Salesforce. It is first pre - trained on a dataset of general SQL queries and then fine - tuned on a dataset of text - to - SQL pairs.

Training Data

General SQL Queries: From The Stack, containing 1M training samples.
Labeled Text - to - SQL Pairs: From more than 20 public sources across the web from standard datasets. Spider and GeoQuery datasets are held out for evaluation.

Evaluation Data

The models are evaluated on two text - to - SQL benchmarks: Spider and GeoQuery.

Training Procedure

NSQL is trained using cross - entropy loss to maximize the likelihood of sequential inputs. For fine - tuning on text - to - SQL pairs, the loss is only computed over the SQL portion of the pair. The models are trained using 80GB A100s, leveraging data and model parallelism. It is pre - trained for 3 epochs and fine - tuned for 10 epochs.

Intended Use and Limitations

The model is designed for text - to - SQL generation tasks from given table schemas and natural language prompts. It works best with the defined prompt format and outputting SELECT queries.

🔧 Technical Details

Model Architecture: Based on CodeGen - Multi 6B.
Training: Pre - trained on general SQL queries and fine - tuned on text - to - SQL pairs.
Loss Function: Cross - entropy loss.
Hardware: Trained using 80GB A100s with data and model parallelism.

📄 License

The model is licensed under the bsd - 3 - clause license.

Property	Details
Model Type	Autoregressive open - source large foundation model for SQL generation
Training Data	General SQL queries from The Stack and labeled text - to - SQL pairs from multiple public sources

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご