pip - SQL - 1B: An Open - Source Text - to - SQL Model, Distilled from Llama 1B with Performance Surpassing ChatGPT and Claude

Pip SQL 1B

Developed by PipableAI

A text-to-SQL model distilled from Llama 1B, outperforming ChatGPT and Claude in multiple benchmarks

Large Language Model Supports Multiple LanguagesOpen Source License:MIT #Text-to-SQL #1B Distilled Model #Database Query Generation

Downloads 31

Release Time : 2/3/2024

Model Overview

An expert-level model for generating SQL queries based on prompts and database schemas, optimized for SQL generation through a unique training process

Model Features

Dual-objective Training Process

Alternately optimizes prompt token log probabilities and SQL segment prediction differences to enhance generation quality

Expert-level SQL Generation

Outperforms general-purpose models like ChatGPT and Claude in multiple benchmarks

Multi-framework Support

Provides both PyTorch and JAX/Flax implementation versions

Model Capabilities

Database schema parsing

Natural language to SQL

Complex query generation

Multi-table join queries

Use Cases

Database Management

Radio Equipment Fault Query

Query device IDs and fault descriptions based on wavelength conditions

Generates SQL queries that align with business logic

Job System Data Analysis

Filter job UIDs and task IDs by date

Accurately generates queries containing time conditions

Business Intelligence

Department Budget Analysis

Join queries between department information and management data

Generates complex SQL statements with multi-table joins

🚀 Pipable’s pipSQL

Pipable’s pipSQL is a model for generating SQL queries from prompts and schemas. It outperforms ChatGPT and Claude on many SQL benchmarks and is distilled from Llama 1B.

🚀 Quick Start

Please refer to https://huggingface.co/PipableAI/pipSQL-1.3b for our state - of - the - art model, which gives better performance than ChatGPT and Claude on SQL tasks in a lot of benchmarks.

✨ Features

Pipable’s pipSQL is a model distilled from Llama 1B to generate SQL queries given a prompt and schema. A unique pipeline is used, where the model works on two objectives alternatively:

Maximizing the log prob of all tokens in the sequence (including the prompt tokens).
Minimizing the difference between the true value and the predicted maximum value of the output tokens, i.e., generated tokens for the SQL query slice of the entire sequence.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

text = """<schema>{schema}</schema>
<question>{question}</question>
<sql>"""

Advanced Usage - PyTorch

from transformers import AutoModelForCasualLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("PipableAI/pipSQL1b")
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pipSQL1b")

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('<sql>')[1].split('</sql>')[0])

Advanced Usage - Flax

from transformers import FlaxAutoModelForCasualLM, AutoTokenizer
model = FlaxAutoModelForCausalLM.from_pretrained("PipableAI/pipSQL1b" , from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pipSQL1b")

📚 Documentation

Model Information

Property	Details
Model Type	pipSQL, distilled from Llama 1B
Training Data	PipableAI/spider - bird
Tags	code, sql, text2sql, instruction_tuned, jax, pytorch, 1b, expert
Metrics	accuracy
Pipeline Tag	text - generation

Widget Examples

example1:
- Input: <schema>CREATE TABLE radio(age VARCHAR, radio_id VARCHAR, frequency VARCHAR, wavelength VARCHAR); CREATE TABLE radio_faults(radio_id VARCHAR, fault_description VARCHAR)</schema><question>Get the radio id and defect descriptions of radios that have wavelength greater than 30 ?</question><sql>
example2:
- Input: <schema>CREATE TABLE system(JobID: String,GID: String, UID: String, Start:Time(yyyy/mm/dd), End: Time,ElapsedRaw: Time, CPUTimeRAW: Time,NCPUS: Number,NNodes: Number, NodeList: List, State:String, Timelimit: Time);</schema><question>Get UID and job id for Jobs that started on Jan 20 , 2023</question><sql>
example3:
- Input: <schema>CREATE TABLE department (Department_ID number, Name text, Creation text, Ranking number, Budget_in_Billions number, Num_Employees number) which has Department_ID as primary key abd CREATE TABLE head (head_ID number, name text, born_state text, age number) which has head_ID as primary key and CREATE TABLE management (department_ID number, head_ID number, temporary_acting text) which has department_ID as primary key</schema><question>

🔧 Technical Details

The model uses a unique pipeline where it alternates between two objectives: maximizing the log prob of all tokens in the sequence (including the prompt tokens) and minimizing the difference between the true value and the predicted maximum value of the output tokens for the SQL query slice of the entire sequence.

📄 License

The model's new weights along with all other assets involved with it are open - sourced under the MIT license.

👥 The PipableAI Team

Avi Kothari, Pratham Gupta, Ritvik Aryan Kalra, Rohan Bhatial, Soham Acharya

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご