Jellyfish-13B開源大語言模型 - 免費部署助力數據預處理，如錯誤檢測等

首頁

Jellyfish 13B

由NECOUDBFM開發

水母-13B是一個130億參數的大語言模型，專為數據預處理任務定製，包括錯誤檢測、數據填補、模式匹配和實體匹配。

大型語言模型

Transformers

英語#數據預處理專家 #多任務數據處理 #本地高效運行

下載量 102

發布時間 : 10/16/2023

模型概述

基於Open-Orca/OpenOrca-Platypus2-13B模型微調，專注於數據預處理任務，性能媲美GPT-3.5和GPT-4，可在本地經濟高效運行且保障數據安全。

模型特點

數據預處理專家

專門針對數據清洗和預處理任務優化，在多種數據任務上表現優異

本地高效運行

13B規模模型可在本地部署，平衡性能與資源消耗

雙版本設計

提供標準版和解釋器版，分別適合系統集成和終端用戶使用

模型能力

錯誤檢測

數據填補

模式匹配

實體匹配

列類型標註

屬性值提取

使用案例

數據質量管理

數據集錯誤檢測

識別數據集中的錯誤值和異常值

在Hospital數據集上達到95.59% F1分數

缺失值填補

自動填補數據集中的缺失值

在Buy數據集上達到100%準確率

數據集成

實體匹配

識別不同數據源中指向同一實體的記錄

在DBLP-GoogleScholar數據集上達到98.51% F1分數

🚀 Jellyfish-13B

Jellyfish-13B是一款專為數據預處理任務量身定製的大語言模型，它能夠進行錯誤檢測、數據插補、模式匹配和實體匹配等工作，在性能上可與先前的先進算法和大語言模型相媲美，且支持本地執行，保障數據安全。

🚀 快速開始

我們構建了 Jellyfish-7B 和 Jellyfish-8B 這兩個Jellyfish的輕量級版本。它們在保持出色數據處理性能的同時，還具備更快的推理速度和更強的推理能力！

😄 我們強烈建議用戶使用7B和8B模型，因為它們在處理未見任務時具有出色的泛化能力和推理能力！

✨ 主要特性

針對性強：專門針對數據預處理任務進行微調，包括錯誤檢測、數據插補、模式匹配和實體匹配。
性能卓越：性能可與先前的先進算法和大語言模型（如OpenAI的GPT 3.5和GPT 4）相媲美。
安全經濟：作為一個130億參數的模型，Jellyfish支持本地執行，成本低且不影響數據安全。
版本多樣：發佈了Jellyfish-13B（主分支）和Jellyfish-13B-Interpreter（替代分支）兩個不同版本，適用於不同的應用場景。

📚 詳細文檔

模型詳情

Jellyfish-13B是一個具有130億參數的大語言模型。我們使用與數據預處理任務相關的數據集對 Open-Orca/OpenOrca-Platypus2-13B 模型進行了微調。其性能具有競爭力，可與先前的先進算法和大語言模型（如OpenAI的GPT 3.5和GPT 4）相媲美（如我們早期研究所示）。值得注意的是，作為一個130億參數的模型，Jellyfish允許在不影響數據安全的情況下進行經濟高效的本地執行。此外，它在處理數據預處理任務方面的熟練程度意味著，作為一個大語言模型，Jellyfish在自然語言處理任務中也保持著強大的性能，這可以通過 Jellyfish 和 OpenOraca-Platypus2 之間的自然語言處理基準分數比較得到證明。

我們發佈了Jellyfish的兩個不同版本：Jellyfish-13B（主分支）和Jellyfish-13B-Interpreter（替代分支）。正如名稱所示，Jellyfish-13B旨在提供精確、直接的答案。相比之下，Jellyfish-13B-Interpreter則使用包含推理和順序思維過程的數據進行微調，用於處理數據預處理任務，從GPT-4中提煉知識。

這兩個版本是為不同的應用場景設計的。Jellyfish-13B適用於集成到更大的數據管理系統中，因為它的響應簡單明瞭，可以很容易地轉換為數據管理/分析管道中的代碼。另一方面，Jellyfish-13B-Interpreter更面向用戶，其響應提供了深入的數據洞察，而無需高級編碼技能或對統計學的複雜理解。

有關該模型的更多詳細信息可以在 Jellyfish論文中找到。

屬性	詳情
開發者	Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
聯繫方式	dongyuyang@nec.com
資助方	NEC Corporation, Osaka University
語言	英語
許可證	非商業性知識共享許可協議 (CC BY-NC-4.0)
微調基礎模型	Open-Orca/OpenOrca-Platypus2-13B

引用

如果您覺得我們的工作有用，請通過以下引用給予我們認可：

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}

已知任務的性能

錯誤檢測、數據插補、模式匹配和實體匹配任務

任務	類型	數據集	非大語言模型的最優算法¹	GPT-3.5²	GPT-4²	GPT-4o	Table-GPT	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
錯誤檢測	已知	Adult	99.10	99.10	92.01	83.58	--	77.40	73.74	99.33
錯誤檢測	已知	Hospital	94.40	97.80	90.74	44.76	--	94.51	93.40	95.59
錯誤檢測	未知	Flights	81.00	--	83.48	66.01	--	69.15	66.21	82.52
錯誤檢測	未知	Rayyan	79.00	--	81.95	68.53	--	75.07	81.06	90.65
數據插補	已知	Buy	96.50	98.50	100	100	--	98.46	98.46	100
數據插補	已知	Restaurant	77.20	88.40	97.67	90.70	--	89.53	87.21	89.53
數據插補	未知	Flipkart	68.00	--	89.94	83.20	--	87.14	87.48	81.68
數據插補	未知	Phone	86.70	--	90.79	86.78	--	86.52	85.68	87.21
模式匹配	已知	MIMIC-III	20.00	--	40.00	29.41	--	53.33	45.45	40.00
模式匹配	已知	Synthea	38.50	45.20	66.67	6.56	--	55.56	47.06	56.00
模式匹配	未知	CMS	50.00	--	19.35	22.22	--	42.86	38.10	59.29
實體匹配	已知	Amazon-Google	75.58	63.50	74.21	70.91	70.10	81.69	81.42	81.34
實體匹配	已知	Beer	94.37	100	100	90.32	96.30	100.00	100.00	96.77
實體匹配	已知	DBLP-ACM	98.99	96.60	97.44	95.87	93.80	98.65	98.77	98.98
實體匹配	已知	DBLP-GoogleScholar	95.70	83.80	91.87	90.45	92.40	94.88	95.03	98.51
實體匹配	已知	Fodors-Zagats	100	100	100	93.62	100	100	100	100
實體匹配	已知	iTunes-Amazon	97.06	98.20	100	98.18	94.30	96.30	96.30	98.11
實體匹配	未知	Abt-Buy	89.33	--	92.77	78.73	--	86.06	88.84	89.58
實體匹配	未知	Walmart-Amazon	86.89	87.00	90.27	79.19	82.40	84.91	85.24	89.42
平均			80.44	-	84.17	72.58	-	82.74	81.55	86.02

對於GPT-3.5和GPT-4，我們在所有數據集上使用了少樣本方法。對於Jellyfish模型，少樣本方法在已知數據集上禁用，在未知數據集上啟用。
數據插補任務使用準確率作為指標，其他任務使用F1分數。

HoloDetect 用於錯誤檢測已知數據集
RAHA 用於錯誤檢測未知數據集
IPM 用於數據插補
SMAT 用於模式匹配
Ditto 用於實體匹配
2.
Large Language Models as Data Preprocessors

列類型註釋任務

數據集	RoBERTa (159次少樣本)¹	GPT-3.5¹	GPT-4	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
SOTAB	79.20	89.47	91.55	65.05	83	76.33	82

Jellyfish模型禁用少樣本方法。

結果來自 Column Type Annotation using ChatGPT

屬性值提取任務

數據集	Stable Beluga 2 70B¹	SOLAR 70B¹	GPT-3.5¹	GPT-4 ¹	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
AE-110k	52.10	49.20	61.30	55.50	55.77	56.09	59.55	58.12
OA-Mine	50.80	55.20	62.70	68.90	60.20	51.98	59.22	55.96

Jellyfish模型禁用少樣本方法。

結果來自 Product Attribute Value Extraction using Large Language Models

提示模板

### 指令:

<prompt> (不包含<>)

### 響應:

訓練詳情

訓練數據

我們利用了論文 Can Foundation Models Wrangle Your Data? 中的訓練集和驗證集來微調Jellyfish。原始數據集來自 HazyResearch/fm_data_tasks、RAHA、SMAT 和 IPM。基於這些數據集，我們構建了一個指令調優數據集，用於微調大語言模型，風格類似於 OpenOrca數據集。

訓練方法

我們使用LoRA來加速訓練過程，目標是q_proj、k_proj、v_proj和o_proj模塊。

使用方法

為了加速推理，我們強烈建議使用 vLLM 運行Jellyfish。

Python腳本

我們提供了兩個簡單的Python代碼示例，用於使用Jellyfish模型進行推理。

使用Transformers和Torch模塊

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# 如果模型未緩存，將自動從HuggingFace模型中心下載。
# 模型文件默認將緩存在 "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" 中。
# 您也可以手動下載模型，並將模型名稱替換為模型文件的路徑。
model = AutoModelForCausalLM.from_pretrained(
    "NECOUDBFM/Jellyfish",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# 您需要根據任務和要測試的數據定義user_message變量。
user_message = "Hello, world."

prompt = f"{system_message}\n\n### Instruction:\n\n{user_message}\n\n### Response:\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

# 您可以根據需要修改採樣參數。
generation_config = GenerationConfig(
    do_samples=True,
    temperature=0.35,
    top_p=0.9,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.15,
    )

output = generation_output[0]
response = tokenizer.decode(
    output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()

print(response)

使用vLLM

from vllm import LLM, SamplingParams

# 要使用vLLM進行推理，您需要使用HuggingFace模型中心或手動下載模型文件。
# 您應該根據本地環境修改模型的路徑。
path_to_model = (
    "/workspace/models/Jellyfish"
)

model = LLM(model=path_to_model)

# 您可以根據需要修改採樣參數。
# 注意：stop參數不應更改。
sampling_params = SamplingParams(
    temperature=0.35,
    top_p=0.9,
    max_tokens=1024,
    stop=["### Instruction:"],
)

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# 您需要根據任務和要測試的數據定義user_message變量。
user_message = "Hello, world."

prompt = f"{system_message}\n\n### Instruction:\n\n{user_message}\n\n### Response:\n\n"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)

提示

我們提供了用於微調推理的提示。您可以根據這些提示來構建數據。此外，我們鼓勵嘗試不同的提示，以可能獲得最佳的生成質量。

JellyFish-13B

錯誤檢測

錯誤檢測任務有兩種形式。第一種形式是提供完整的記錄行，任務是確定特定值是否有誤。第二種形式是僅給出特定屬性的值，僅根據屬性的名稱和值來判斷其正確性。以下提示示例分別對應這兩種形式。

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

數據插補

You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.  
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.  
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?  
Answer only the value of {attribute X}.

模式匹配

Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].

實體匹配

You are tasked with determining whether two records listed below are the same based on the information provided.
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.  
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.  
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity? Choose your answer from: [Yes, No].

列類型註釋

我們遵循 Column Type Annotation using ChatGPT 中的提示 (text+inst+2-step)。

屬性值提取

我們遵循 Product Attribute Value Extraction using Large Language Models 中的提示 (textual, w/o examples)。

JellyFish-13B-Interpreter

錯誤檢測

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}?
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record.
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}?
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",

數據插補

You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
Your task is to deduce or infer the manufacturer of the product using the available information in the record.
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
After your reasoning, finish your response in a separate line with and ONLY with your final answer.
Your final answer should only consist of the value of {attribute X}.

模式匹配

Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

實體匹配

You are tasked with determining whether two products listed below are the same based on the information provided.
Carefully examine all the attributes before making your decision.  
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.  
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity?  
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

Jellyfish-13B-Interpreter的示例響應

我們提供了Jellyfish-13B-Interpreter的一些示例響應，以展示其性能。

為了更易於閱讀，我們顯示原始數據記錄而不是整個提示。

Product A: [name: \"adobe creative suite cs3 web standard [ mac ]\", manufacturer: \"adobe\", price: \"999.0\"]
Product B: [name: \"adobe creative suite 3 ( cs3 ) web standard 3 windows upsell from dreamweaver or flash\", price: \"799.0\"]

Product A and Product B are not the same. Although both products are Adobe Creative Suite CS3 Web Standard, they are for different operating systems.
Product A is for Mac and Product B is for Windows. This difference in operating systems indicates that the products are not the same, as they are designed for different hardware and software environments.
Additionally, Product B mentions that it is an upsell from Dreamweaver or Flash, which further differentiates it from Product A.
No

Article A: [title: \"l/mrp : a buffer management strategy for interactive continuous data flows in a multimedia dbms\", authors: \"f moser , a kraiss , w klas\", venue: \"vldb\", year: \"1995\"]
Article B: [title: \"l/mrp : a buffer management strategy for interactive continuous data flows in a multimedia dbms\", authors: \"f moser , w klas\", venue: \"proceedings of the international conference on very large & hellip ; ,\", year: \"1995.0\"]

The titles of both Article A and Article B are identical, which is a strong indicator that they are the same article.
The authors of both articles are also the same, with the exception that Article B does not include the author \"a kraiss\". However, this could be due to a mistake in data entry or a missing author in Article B.
The venue for both articles is also similar, with Article A listed as \"vldb\" and Article B listed as \"proceedings of the international conference on very large & hellip ; ,\" which is likely the same conference.
The year of publication is also the same for both articles, with Article A listed as \"1995\" and Article B listed as \"1995.0\".
Given these similarities, it is reasonable to conclude that Article A and Article B are the same.
Yes

Record: [name: "le montrachet bistro", addr: "3000 paradise rd.", phone: "702-732-5651", type: "french bistro"].
missing attribute: "City"

The restaurant record provides the name, address, phone number, and type of restaurant. The name and type do not provide any specific information about the city. However, the address and phone number can be used to infer the city.
The address, "3000 paradise rd.", could potentially be in many different cities. However, the phone number, "702-732-5651", provides a clue. The area code, 702, is used in the state of Nevada, specifically in the Las Vegas metropolitan area.
Therefore, based on the area code of the phone number, it can be inferred that the restaurant is located in Las Vegas.
Las Vegas