🚀 ChatDB的Natural - SQL - 7B模型
Natural - SQL - 7B是一款在文本轉SQL指令方面表現極為出色的模型,它能夠出色地理解複雜問題,在同規模模型中脫穎而出。
ChatDB.ai | Notebook | Twitter

🚀 快速開始
安裝依賴
確保你安裝了正確版本的transformers庫:
pip install transformers==4.35.2
加載模型
使用以下Python代碼加載模型:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("chatdb/natural-sql-7b")
model = AutoModelForCausalLM.from_pretrained(
"chatdb/natural-sql-7b",
device_map="auto",
torch_dtype=torch.float16,
)
生成SQL
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(
**inputs,
num_return_sequences=1,
eos_token_id=100001,
pad_token_id=100001,
max_new_tokens=400,
do_sample=False,
num_beams=1,
)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs[0].split("```sql")[-1])
✨ 主要特性
- 在文本轉SQL指令方面表現強勁,能出色理解複雜問題。
- 可以處理其他模型通常難以應對的複雜複合問題。
📦 安裝指南
安裝正確版本的transformers庫:
pip install transformers==4.35.2
💻 使用示例
基礎用法
加載模型
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("chatdb/natural-sql-7b")
model = AutoModelForCausalLM.from_pretrained(
"chatdb/natural-sql-7b",
device_map="auto",
torch_dtype=torch.float16,
)
生成SQL
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(
**inputs,
num_return_sequences=1,
eos_token_id=100001,
pad_token_id=100001,
max_new_tokens=400,
do_sample=False,
num_beams=1,
)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs[0].split("```sql")[-1])
提示模板
# Task
Generate a SQL query to answer the following question: `{自然語言問題}`
### PostgreSQL Database Schema
The query will run on a database with the following schema:
<SQL Table DDL Statements>
# SQL
Here is the SQL query that answers the question: `{自然語言問題}`
'''sql
示例SQL輸出
示例模式
CREATE TABLE users (
user_id SERIAL PRIMARY KEY,
username VARCHAR(50) NOT NULL,
email VARCHAR(100) NOT NULL,
password_hash TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE projects (
project_id SERIAL PRIMARY KEY,
project_name VARCHAR(100) NOT NULL,
description TEXT,
start_date DATE,
end_date DATE,
owner_id INTEGER REFERENCES users(user_id)
);
CREATE TABLE tasks (
task_id SERIAL PRIMARY KEY,
task_name VARCHAR(100) NOT NULL,
description TEXT,
due_date DATE,
status VARCHAR(50),
project_id INTEGER REFERENCES projects(project_id)
);
CREATE TABLE taskassignments (
assignment_id SERIAL PRIMARY KEY,
task_id INTEGER REFERENCES tasks(task_id),
user_id INTEGER REFERENCES users(user_id),
assigned_date DATE NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE comments (
comment_id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
task_id INTEGER REFERENCES tasks(task_id),
user_id INTEGER REFERENCES users(user_id)
);
示例SQL輸出
問題:顯示用戶加入最多的日期
SELECT created_at::DATE AS day, COUNT(*) AS user_count
FROM users
GROUP BY day
ORDER BY user_count DESC
LIMIT 1;
問題:顯示擁有評論最多任務的項目
SELECT p.project_name, t.task_name, COUNT(c.comment_id) AS comment_count
FROM projects p
JOIN tasks t ON p.project_id = t.project_id
JOIN comments c ON t.task_id = c.task_id
GROUP BY p.project_name, t.task_name
ORDER BY comment_count DESC
LIMIT 1;
問題:擁有Gmail郵箱地址的用戶與沒有Gmail郵箱地址的用戶的比例是多少?
SELECT
SUM(CASE WHEN email ILIKE '%@gmail.com%' THEN 1 ELSE 0 END)::FLOAT / NULLIF(SUM(CASE WHEN email NOT ILIKE '%@gmail.com%' THEN 1 ELSE 0 END), 0) AS gmail_ratio
FROM
users;
📚 詳細文檔
基準測試
未訓練的新數據集上的結果(通過SQL - Eval)

非常感謝defog團隊開源了sql - eval👏
Natural - SQL還可以處理其他模型通常難以應對的複雜複合問題。這裡有一個更詳細的報告,一個小測試在這裡。
📄 許可證
模型權重遵循CC BY - SA 4.0
許可協議,並在原始模型Deepseek許可協議的基礎上擴展了負責任使用的額外指南。你可以自由使用和修改該模型,甚至用於商業目的。如果你修改了模型權重,例如通過微調,你必須在相同的CC BY - SA 4.0
許可協議下公開分享你的更改。
📋 模型信息
屬性 |
詳情 |
基礎模型 |
deepseek - ai/deepseek - coder - 6.7b - instruct |
標籤 |
instruct、finetune |
庫名稱 |
transformers |
許可證 |
cc - by - sa - 4.0 |
任務類型 |
文本生成 |