pip-sql-1.3b Open-Source SQL Generation Model - Free to Use, Outperforms Most Expert Models and ChatGPT

Pip Sql 1.3b

Developed by PipableAI

A 1.3 billion parameter SQL generation model that outperforms most SQL expert models and ChatGPT in multiple popular benchmarks.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Text-to-SQL Generation #Efficient SQL Query #Database Interaction

Downloads 1,288

Release Time : 2/14/2024

Model Overview

A text-to-SQL conversion model refined from the deepseek base model, capable of generating SQL queries based on natural language questions and database schemas.

Model Features

High-performance SQL Generation

Outperforms similar models and ChatGPT in benchmarks such as Spider, SParC, and CoSQL.

Efficient Parameter Scale

Achieves excellent performance with only 1.3 billion parameters, offering efficiency advantages over larger models.

Multi-framework Support

Supports two mainstream deep learning frameworks: PyTorch and JAX/Flax.

Model Capabilities

Natural Language to SQL

Database Query Generation

Complex SQL Statement Construction

Use Cases

Database Management

Business Data Analysis

Non-technical personnel querying databases using natural language.

Automatically generates accurate SQL queries.

Database Application Development

Automatically generating database query code during rapid prototyping.

Reduces SQL writing time and improves development efficiency.

🚀 pipSQL-1.3b

A 1.3 billion SQL model that outperforms most SQL expert models and ChatGPT on popular benchmarks.

🚀 Quick Start

What have we built?

A 1.3 billion SQL model that outperforms most SQL expert models and ChatGPT on popular benchmarks. This is a distilled model built on the DeepSeek base model. For our state-of-the-art model, please refer to PipableAI/pip-library-etl-1.3b.

How we built it?

We used softmax cross-entropy and a modified form of policy gradient along with Q loss, optimized in an EM setup. The loss behavior in the above setup is shown in the following image:

Loss Behavior

✨ Features

Benchmarking

For benchmarking purposes, we are using Semantic Evaluation for Text-to-SQL with Distilled Test Suites, an officially accepted evaluation framework for Spider, SParC, and CoSQL proposed by a research team from Yale and Berkeley. The benchmark contains 2200 test data points. You can run the evaluation using the following link:

Test Suite SQL Eval

Model	Easy	Medium	Hard	Extra
sqlcoder-7b-2	72.0	58.0	40.6	37.3
pipSQL-1.3b	78.5	57.5	42.1	28.3
pipSQL-7b	63.0	40.0	30.2	25.0
sqlcoder-7b	60.6	48.2	28.3	20.4
gpt-3.5	58.8	44.7	31.0	28.4

We have also benchmarked it on Defog Eval, which contains 200 test data points handpicked by the Defog team. Here is the link:

Defog SQL-Eval

The results are shown in the following image:

Defog Eval Results

📦 Installation

pip install transformers

💻 Usage Examples

Prompt

prompt = f"""<schema>{schema}</schema>
<question>{question}</question>
<sql>"""

Basic Usage - PyTorch

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b")
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

Advanced Usage - Flax

from transformers import FlaxAutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = FlaxAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b", from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")

inputs = tokenizer(text, return_tensors="jax")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

📚 Documentation

Schema

CREATE TABLE Products (
  product_id number,
  parent_product_id number,
  product_name text,
  product_price number,
  product_color text,
  product_size text,
  product_description text);

CREATE TABLE Customers (
  customer_id number,
  gender_code text,
  customer_first_name text,
  customer_middle_initial text,
  customer_last_name text,
  email_address text,
  login_name text,
  login_password text,
  phone_number text,
  address_line_1 text,
  town_city text,
  county text,
  country text);

CREATE TABLE Customer_Payment_Methods (
  customer_id number,
  payment_method_code text);

CREATE TABLE Invoices (
  invoice_number number,
  invoice_status_code text,
  invoice_date time);

CREATE TABLE Orders (
  order_id number,
  customer_id number,
  order_status_code text,
  date_order_placed time);

CREATE TABLE Order_Items (
  order_item_id number,
  product_id number,
  order_id number,
  order_item_status_code text);

CREATE TABLE Shipments (
  shipment_id number,
  order_id number,
  invoice_number number,
  shipment_tracking_number text,
  shipment_date time);

CREATE TABLE Shipment_Items (
  shipment_id number,
  order_item_id number);

Questions

What are the email address, town and county of the customers who are of the least common gender?

SELECT email_address ,  town_city ,  county FROM customers GROUP BY gender_code ORDER BY count(*) ASC LIMIT 1

What are the product price and the product size of the products whose price is above average?

SELECT product_price ,  product_size FROM products WHERE product_price  > (SELECT avg(product_price) FROM products)

Which customers did not make any orders? List the first name, middle initial and last name.

SELECT T1.customer_first_name ,  T1.customer_middle_initial ,  T1.customer_last_name FROM Customers AS T1 WHERE T1.customer_id NOT IN (SELECT T2.customer_id FROM Orders AS T2)

🔧 Technical Details

We used softmax cross-entropy and a modified form of policy gradient along with Q loss, optimized in an EM setup.

📄 License

The model is open source under the Apache 2.0 License.

👥 Team

Avi Kothari, Pratham Gupta, Ritvik Aryan Kalra, Rohan Bhatial, Soham Acharya

Additional Information

Property	Details
Model Type	A 1.3 billion SQL model distilled from the DeepSeek base model
Training Data	Not provided
Metrics	Accuracy
Tags	sql, code, text2sql, instruction_tuned, basemodel, jax, pytorch, text-generation-inference
Library Name	transformers
Pipeline Tag	text-generation
Datasets	PipableAI/pip-txt-to-sql-spider-bird-dataset

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご