Jellyfish-13B开源大语言模型 - 免费部署助力数据预处理，如错误检测等

首页

Jellyfish 13B

由 NECOUDBFM 开发

水母-13B是一个130亿参数的大语言模型，专为数据预处理任务定制，包括错误检测、数据填补、模式匹配和实体匹配。

大型语言模型

Transformers

英语#数据预处理专家 #多任务数据处理 #本地高效运行

下载量 102

发布时间 : 10/16/2023

模型简介

基于Open-Orca/OpenOrca-Platypus2-13B模型微调，专注于数据预处理任务，性能媲美GPT-3.5和GPT-4，可在本地经济高效运行且保障数据安全。

模型特点

数据预处理专家

专门针对数据清洗和预处理任务优化，在多种数据任务上表现优异

本地高效运行

13B规模模型可在本地部署，平衡性能与资源消耗

双版本设计

提供标准版和解释器版，分别适合系统集成和终端用户使用

模型能力

错误检测

数据填补

模式匹配

实体匹配

列类型标注

属性值提取

使用案例

数据质量管理

数据集错误检测

识别数据集中的错误值和异常值

在Hospital数据集上达到95.59% F1分数

缺失值填补

自动填补数据集中的缺失值

在Buy数据集上达到100%准确率

数据集成

实体匹配

识别不同数据源中指向同一实体的记录

在DBLP-GoogleScholar数据集上达到98.51% F1分数

🚀 Jellyfish-13B

Jellyfish-13B是一款专为数据预处理任务量身定制的大语言模型，它能够进行错误检测、数据插补、模式匹配和实体匹配等工作，在性能上可与先前的先进算法和大语言模型相媲美，且支持本地执行，保障数据安全。

🚀 快速开始

我们构建了 Jellyfish-7B 和 Jellyfish-8B 这两个Jellyfish的轻量级版本。它们在保持出色数据处理性能的同时，还具备更快的推理速度和更强的推理能力！

😄 我们强烈建议用户使用7B和8B模型，因为它们在处理未见任务时具有出色的泛化能力和推理能力！

✨ 主要特性

针对性强：专门针对数据预处理任务进行微调，包括错误检测、数据插补、模式匹配和实体匹配。
性能卓越：性能可与先前的先进算法和大语言模型（如OpenAI的GPT 3.5和GPT 4）相媲美。
安全经济：作为一个130亿参数的模型，Jellyfish支持本地执行，成本低且不影响数据安全。
版本多样：发布了Jellyfish-13B（主分支）和Jellyfish-13B-Interpreter（替代分支）两个不同版本，适用于不同的应用场景。

📚 详细文档

模型详情

Jellyfish-13B是一个具有130亿参数的大语言模型。我们使用与数据预处理任务相关的数据集对 Open-Orca/OpenOrca-Platypus2-13B 模型进行了微调。其性能具有竞争力，可与先前的先进算法和大语言模型（如OpenAI的GPT 3.5和GPT 4）相媲美（如我们早期研究所示）。值得注意的是，作为一个130亿参数的模型，Jellyfish允许在不影响数据安全的情况下进行经济高效的本地执行。此外，它在处理数据预处理任务方面的熟练程度意味着，作为一个大语言模型，Jellyfish在自然语言处理任务中也保持着强大的性能，这可以通过 Jellyfish 和 OpenOraca-Platypus2 之间的自然语言处理基准分数比较得到证明。

我们发布了Jellyfish的两个不同版本：Jellyfish-13B（主分支）和Jellyfish-13B-Interpreter（替代分支）。正如名称所示，Jellyfish-13B旨在提供精确、直接的答案。相比之下，Jellyfish-13B-Interpreter则使用包含推理和顺序思维过程的数据进行微调，用于处理数据预处理任务，从GPT-4中提炼知识。

这两个版本是为不同的应用场景设计的。Jellyfish-13B适用于集成到更大的数据管理系统中，因为它的响应简单明了，可以很容易地转换为数据管理/分析管道中的代码。另一方面，Jellyfish-13B-Interpreter更面向用户，其响应提供了深入的数据洞察，而无需高级编码技能或对统计学的复杂理解。

有关该模型的更多详细信息可以在 Jellyfish论文中找到。

属性	详情
开发者	Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
联系方式	dongyuyang@nec.com
资助方	NEC Corporation, Osaka University
语言	英语
许可证	非商业性知识共享许可协议 (CC BY-NC-4.0)
微调基础模型	Open-Orca/OpenOrca-Platypus2-13B

引用

如果您觉得我们的工作有用，请通过以下引用给予我们认可：

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}

已知任务的性能

错误检测、数据插补、模式匹配和实体匹配任务

任务	类型	数据集	非大语言模型的最优算法¹	GPT-3.5²	GPT-4²	GPT-4o	Table-GPT	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
错误检测	已知	Adult	99.10	99.10	92.01	83.58	--	77.40	73.74	99.33
错误检测	已知	Hospital	94.40	97.80	90.74	44.76	--	94.51	93.40	95.59
错误检测	未知	Flights	81.00	--	83.48	66.01	--	69.15	66.21	82.52
错误检测	未知	Rayyan	79.00	--	81.95	68.53	--	75.07	81.06	90.65
数据插补	已知	Buy	96.50	98.50	100	100	--	98.46	98.46	100
数据插补	已知	Restaurant	77.20	88.40	97.67	90.70	--	89.53	87.21	89.53
数据插补	未知	Flipkart	68.00	--	89.94	83.20	--	87.14	87.48	81.68
数据插补	未知	Phone	86.70	--	90.79	86.78	--	86.52	85.68	87.21
模式匹配	已知	MIMIC-III	20.00	--	40.00	29.41	--	53.33	45.45	40.00
模式匹配	已知	Synthea	38.50	45.20	66.67	6.56	--	55.56	47.06	56.00
模式匹配	未知	CMS	50.00	--	19.35	22.22	--	42.86	38.10	59.29
实体匹配	已知	Amazon-Google	75.58	63.50	74.21	70.91	70.10	81.69	81.42	81.34
实体匹配	已知	Beer	94.37	100	100	90.32	96.30	100.00	100.00	96.77
实体匹配	已知	DBLP-ACM	98.99	96.60	97.44	95.87	93.80	98.65	98.77	98.98
实体匹配	已知	DBLP-GoogleScholar	95.70	83.80	91.87	90.45	92.40	94.88	95.03	98.51
实体匹配	已知	Fodors-Zagats	100	100	100	93.62	100	100	100	100
实体匹配	已知	iTunes-Amazon	97.06	98.20	100	98.18	94.30	96.30	96.30	98.11
实体匹配	未知	Abt-Buy	89.33	--	92.77	78.73	--	86.06	88.84	89.58
实体匹配	未知	Walmart-Amazon	86.89	87.00	90.27	79.19	82.40	84.91	85.24	89.42
平均			80.44	-	84.17	72.58	-	82.74	81.55	86.02

对于GPT-3.5和GPT-4，我们在所有数据集上使用了少样本方法。对于Jellyfish模型，少样本方法在已知数据集上禁用，在未知数据集上启用。
数据插补任务使用准确率作为指标，其他任务使用F1分数。

HoloDetect 用于错误检测已知数据集
RAHA 用于错误检测未知数据集
IPM 用于数据插补
SMAT 用于模式匹配
Ditto 用于实体匹配
2.
Large Language Models as Data Preprocessors

列类型注释任务

数据集	RoBERTa (159次少样本)¹	GPT-3.5¹	GPT-4	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
SOTAB	79.20	89.47	91.55	65.05	83	76.33	82

Jellyfish模型禁用少样本方法。

结果来自 Column Type Annotation using ChatGPT

属性值提取任务

数据集	Stable Beluga 2 70B¹	SOLAR 70B¹	GPT-3.5¹	GPT-4 ¹	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
AE-110k	52.10	49.20	61.30	55.50	55.77	56.09	59.55	58.12
OA-Mine	50.80	55.20	62.70	68.90	60.20	51.98	59.22	55.96

Jellyfish模型禁用少样本方法。

结果来自 Product Attribute Value Extraction using Large Language Models

提示模板

### 指令:

<prompt> (不包含<>)

### 响应:

训练详情

训练数据

我们利用了论文 Can Foundation Models Wrangle Your Data? 中的训练集和验证集来微调Jellyfish。原始数据集来自 HazyResearch/fm_data_tasks、RAHA、SMAT 和 IPM。基于这些数据集，我们构建了一个指令调优数据集，用于微调大语言模型，风格类似于 OpenOrca数据集。

训练方法

我们使用LoRA来加速训练过程，目标是q_proj、k_proj、v_proj和o_proj模块。

使用方法

为了加速推理，我们强烈建议使用 vLLM 运行Jellyfish。

Python脚本

我们提供了两个简单的Python代码示例，用于使用Jellyfish模型进行推理。

使用Transformers和Torch模块

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# 如果模型未缓存，将自动从HuggingFace模型中心下载。
# 模型文件默认将缓存在 "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" 中。
# 您也可以手动下载模型，并将模型名称替换为模型文件的路径。
model = AutoModelForCausalLM.from_pretrained(
    "NECOUDBFM/Jellyfish",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# 您需要根据任务和要测试的数据定义user_message变量。
user_message = "Hello, world."

prompt = f"{system_message}\n\n### Instruction:\n\n{user_message}\n\n### Response:\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

# 您可以根据需要修改采样参数。
generation_config = GenerationConfig(
    do_samples=True,
    temperature=0.35,
    top_p=0.9,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.15,
    )

output = generation_output[0]
response = tokenizer.decode(
    output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()

print(response)

使用vLLM

from vllm import LLM, SamplingParams

# 要使用vLLM进行推理，您需要使用HuggingFace模型中心或手动下载模型文件。
# 您应该根据本地环境修改模型的路径。
path_to_model = (
    "/workspace/models/Jellyfish"
)

model = LLM(model=path_to_model)

# 您可以根据需要修改采样参数。
# 注意：stop参数不应更改。
sampling_params = SamplingParams(
    temperature=0.35,
    top_p=0.9,
    max_tokens=1024,
    stop=["### Instruction:"],
)

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# 您需要根据任务和要测试的数据定义user_message变量。
user_message = "Hello, world."

prompt = f"{system_message}\n\n### Instruction:\n\n{user_message}\n\n### Response:\n\n"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)

提示

我们提供了用于微调推理的提示。您可以根据这些提示来构建数据。此外，我们鼓励尝试不同的提示，以可能获得最佳的生成质量。

JellyFish-13B

错误检测

错误检测任务有两种形式。第一种形式是提供完整的记录行，任务是确定特定值是否有误。第二种形式是仅给出特定属性的值，仅根据属性的名称和值来判断其正确性。以下提示示例分别对应这两种形式。

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

数据插补

You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.  
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.  
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?  
Answer only the value of {attribute X}.

模式匹配

Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].

实体匹配

You are tasked with determining whether two records listed below are the same based on the information provided.
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.  
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.  
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity? Choose your answer from: [Yes, No].

列类型注释

我们遵循 Column Type Annotation using ChatGPT 中的提示 (text+inst+2-step)。

属性值提取

我们遵循 Product Attribute Value Extraction using Large Language Models 中的提示 (textual, w/o examples)。

JellyFish-13B-Interpreter

错误检测

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}?
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record.
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}?
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].",

数据插补

You are presented with a {keyword} record that is missing a specific attribute {attribute X}.
Your task is to deduce or infer the manufacturer of the product using the available information in the record.
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Based on the provided product record, what would you infer is the value for the missing attribute {attribute X}?
After your reasoning, finish your response in a separate line with and ONLY with your final answer.
Your final answer should only consist of the value of {attribute X}.

模式匹配

Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

实体匹配

You are tasked with determining whether two products listed below are the same based on the information provided.
Carefully examine all the attributes before making your decision.  
Note: Missing values (N/A or \"nan\") should not be used as a basis for your decision.  
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity?  
After your reasoning, finish your response in a separate line with and ONLY with your final answer. Choose your final answer from [Yes, No].

Jellyfish-13B-Interpreter的示例响应

我们提供了Jellyfish-13B-Interpreter的一些示例响应，以展示其性能。

为了更易于阅读，我们显示原始数据记录而不是整个提示。

Product A: [name: \"adobe creative suite cs3 web standard [ mac ]\", manufacturer: \"adobe\", price: \"999.0\"]
Product B: [name: \"adobe creative suite 3 ( cs3 ) web standard 3 windows upsell from dreamweaver or flash\", price: \"799.0\"]

Product A and Product B are not the same. Although both products are Adobe Creative Suite CS3 Web Standard, they are for different operating systems.
Product A is for Mac and Product B is for Windows. This difference in operating systems indicates that the products are not the same, as they are designed for different hardware and software environments.
Additionally, Product B mentions that it is an upsell from Dreamweaver or Flash, which further differentiates it from Product A.
No

Article A: [title: \"l/mrp : a buffer management strategy for interactive continuous data flows in a multimedia dbms\", authors: \"f moser , a kraiss , w klas\", venue: \"vldb\", year: \"1995\"]
Article B: [title: \"l/mrp : a buffer management strategy for interactive continuous data flows in a multimedia dbms\", authors: \"f moser , w klas\", venue: \"proceedings of the international conference on very large & hellip ; ,\", year: \"1995.0\"]

The titles of both Article A and Article B are identical, which is a strong indicator that they are the same article.
The authors of both articles are also the same, with the exception that Article B does not include the author \"a kraiss\". However, this could be due to a mistake in data entry or a missing author in Article B.
The venue for both articles is also similar, with Article A listed as \"vldb\" and Article B listed as \"proceedings of the international conference on very large & hellip ; ,\" which is likely the same conference.
The year of publication is also the same for both articles, with Article A listed as \"1995\" and Article B listed as \"1995.0\".
Given these similarities, it is reasonable to conclude that Article A and Article B are the same.
Yes

Record: [name: "le montrachet bistro", addr: "3000 paradise rd.", phone: "702-732-5651", type: "french bistro"].
missing attribute: "City"

The restaurant record provides the name, address, phone number, and type of restaurant. The name and type do not provide any specific information about the city. However, the address and phone number can be used to infer the city.
The address, "3000 paradise rd.", could potentially be in many different cities. However, the phone number, "702-732-5651", provides a clue. The area code, 702, is used in the state of Nevada, specifically in the Las Vegas metropolitan area.
Therefore, based on the area code of the phone number, it can be inferred that the restaurant is located in Las Vegas.
Las Vegas