T5-LM-Large-text2sql-spider開源模型 - 免費將文本轉換為可執行SQL查詢

首頁

T5 LM Large Text2sql Spider

由gaussalgo開發

基於T5-large-LM-adapt微調的文本到SQL轉換模型，通過整合數據庫表結構信息生成可執行SQL查詢

大型語言模型

Transformers

英語#結構化查詢生成 #數據庫感知 #自然語言轉SQL

下載量 2,124

發布時間 : 4/25/2023

模型概述

該模型能夠根據自然語言問題和數據庫表結構生成結構化的SQL查詢語句，特別適用於數據庫查詢場景。

模型特點

數據庫結構整合

在訓練過程中將數據庫表結構整合至輸入問題中，明確指定可用的數據列和關聯關係

跨數據庫泛化能力

能夠處理訓練數據中未出現過的數據庫結構，具有良好的泛化性能

可執行SQL生成

生成的SQL查詢可直接在目標數據庫上執行，避免了未知列名等問題

模型能力

自然語言到SQL轉換

數據庫查詢生成

結構化數據訪問

使用案例

數據庫查詢

音樂家信息查詢

根據國籍查詢音樂家的平均、最小和最大年齡

生成SQL: SELECT avg(年齡), min(年齡), max(年齡) FROM 歌手 WHERE 國籍 = '法國'

數據報表生成

統計報表生成

根據自然語言描述生成各類統計報表的SQL查詢

🚀 T5大語言模型適配文本轉SQL

本模型旨在根據自然語言提示生成結構化的SQL查詢。它通過學習自然語言問題來生成對應的SQL查詢，同時在訓練時將數據庫模式融入輸入問題，使模型能更好地考慮特定數據庫的結構，從而生成適用的SQL查詢。

🚀 快速開始

本模型用於文本轉SQL任務，能夠根據自然語言問題生成對應的SQL查詢。在訓練過程中，我們將數據庫模式信息加入到輸入問題中，讓模型學習模式與預期輸出的映射，從而更好地泛化到訓練數據中未出現的模式。

✨ 主要特性

結合數據庫模式：在訓練時將數據庫模式融入輸入問題，使模型能考慮特定數據庫的結構，生成適用的SQL查詢。
更好的泛化能力：通過學習模式與預期輸出的映射，模型能更好地泛化到訓練數據中未出現的模式。

📦 安裝指南

文檔未提及安裝步驟，故跳過。

💻 使用示例

基礎用法

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = 'gaussalgo/T5-LM-Large-text2sql-spider'
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

question = "What is the average, minimum, and maximum age for all French musicians?"
schema = """
   "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int , "Average" int , foreign_key:  primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text , "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool , foreign_key:  primary key: "Singer_ID" [SEP] "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium" "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert"  foreign_key: "concert_ID" int from "concert" "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"
"""

input_text = " ".join(["Question: ",question, "Schema:", schema])

model_inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**model_inputs, max_length=512)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("SQL Query:")
print(output_text)

輸出：

SQL Query:
SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'

📚 詳細文檔

數據集

本模型在Spider和Spider-Syn數據集的訓練分割上進行了微調。在輸入中，除了問題本身，還添加了數據庫模式，以便模型能針對給定數據庫生成查詢。

輸入提示示例：

Question:  What is the average, minimum, and maximum age for all French musicians?
Schema: "stadium" "Stadium_ID" int , "Location" text , "Name" text , "Capacity" int , "Highest" int , "Lowest" int ,
        "Average" int , foreign_key:  primary key: "Stadium_ID" [SEP] "singer" "Singer_ID" int , "Name" text , "Country" text ,
        "Song_Name" text , "Song_release_year" text , "Age" int , "Is_male" bool ,
        foreign_key:  primary key: "Singer_ID" [SEP],
        "concert" "concert_ID" int , "concert_Name" text , "Theme" text , "Year" text , foreign_key: "Stadium_ID" text from "stadium",
        "Stadium_ID" , primary key: "concert_ID" [SEP] "singer_in_concert",
        foreign_key: "concert_ID" int from "concert",
        "concert_ID" , "Singer_ID" text from "singer" "Singer_ID" , primary key: "concert_ID" "Singer_ID"

預期輸出示例：

SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'

數據庫模式格式

模型訓練使用的標準化數據庫模式格式如下：

table_name column1_name column1_type column2_name column2_type ... foreign_key: FK_name FK_type from table_name column_name primary key: column_name [SEP]
table_name2 ...

評估

評估在Spider和Spider-syn數據集的開發分割上進行。開發分割中的數據庫與訓練分割中的數據庫沒有交集，以確保模型在訓練過程中未接觸到評估的數據庫。評估通過比較使用生成查詢和參考查詢對數據庫進行查詢的結果來進行。Spider和Spider-Syn開發分割均有1032個樣本。

Spider開發集準確率：49.2%
Spider Syn開發集準確率：39.5%

訓練

模型使用Adaptor庫 0.2.1在Spider和Spider-syn數據集的訓練分割上進行訓練，參數如下：

training_arguments = AdaptationArguments(output_dir="train_dir",
                                         learning_rate=5e-5,
                                         stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
                                         stopping_patience=8,
                                         save_total_limit=8,
                                         do_train=True,
                                         do_eval=True,
                                         bf16=True,
                                         warmup_steps=1000,
                                         gradient_accumulation_steps=8,
                                         logging_steps=10,
                                         eval_steps=200,
                                         save_steps=1000,
                                         num_train_epochs=10,
                                         evaluation_strategy="steps")

訓練過程相對容易復現，但我們不希望發佈其依賴的修改後的Spider數據集副本。如果您想進一步研究，請通過新的PR或發送電子郵件至stefanik(at)gaussalgo.com與我們聯繫。

🔧 技術細節

本模型基於t5-large-LM-adapt檢查點進行微調。在文本轉SQL任務中，模型通常需要根據自然語言問題生成SQL查詢，但有時生成的查詢可能包含未知列等問題，且未考慮特定數據庫的模式。我們的方法是在訓練時將數據庫模式融入輸入問題，讓模型學習模式與預期輸出的映射，從而更好地泛化到訓練數據中未出現的模式。