Jellyfish-7B开源大语言模型 - 免费部署高效完成数据预处理，可解释性强

首页

Jellyfish 7B

由 NECOUDBFM 开发

水母-7B是一个基于Mistral-7B微调的70亿参数大语言模型，专注于数据预处理任务，在可解释性方面表现优异。

大型语言模型

Transformers

英语#数据预处理优化 #高解释性语言模型 #表格数据处理

下载量 267

发布时间 : 2/28/2024

模型简介

水母-7B是在Mistral-7B-Instruct-v0.2基础上使用Jellyfish-Instruct数据集微调的大语言模型，主要用于数据预处理相关任务，在错误检测等任务上表现优于GPT-3.5-turbo。

模型特点

数据预处理优化

专门针对数据预处理任务进行微调，在错误检测、数据补全等任务上表现优异

可解释性强

在可解释性方面对抗GPT-3.5-turbo的胜率达到56.36%

高效训练

采用LoRA技术加速训练过程，优化特定模块

模型能力

错误检测

数据补全

列类型标注

属性值提取

文本生成

使用案例

数据清洗

数据集错误检测

在Adult和Hospital等数据集上进行错误检测

在Hospital数据集上达到94.51%的F1分数

数据补全

缺失值推断

根据已有字段推断缺失属性值

🚀 Jellyfish-7B

Jellyfish-7B是一个拥有70亿参数的大语言模型。它基于特定数据集对基础模型进行微调，在数据预处理等任务上表现出色，能有效助力数据处理工作。

🚀 快速开始

若要加速推理过程，强烈建议使用 vLLM 运行Jellyfish模型。以下是两个简单的Python代码示例，用于使用Jellyfish模型进行推理：

基础用法

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Model will be automatically downloaded from HuggingFace model hub if not cached.
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
# You can also download the model manually and replace the model name with the path to the model files.
model = AutoModelForCausalLM.from_pretrained(
    "NECOUDBFM/Jellyfish",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

# You can modify the sampling parameters according to your needs.
generation_config = GenerationConfig(
    do_samples=True,
    temperature=0.35,
    top_p=0.9,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.15,
    )

output = generation_output[0]
response = tokenizer.decode(
    output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()

print(response)

高级用法

from vllm import LLM, SamplingParams

# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
# You should modify the path to the model according to your local environment.
path_to_model = (
    "/workspace/models/Jellyfish"
)

model = LLM(model=path_to_model)

# You can modify the sampling parameters according to your needs.
# Caution: The stop parameter should not be changed.
sampling_params = SamplingParams(
    temperature=0.35,
    top_p=0.9,
    max_tokens=1024,
    stop=["[INST]"],
)

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)

✨ 主要特性

高性能表现：在多个数据处理任务（如错误检测、数据插补、模式匹配和实体匹配等）上展现出良好的性能，部分任务表现优于GPT-3.5等模型。
可解释性强：Jellyfish-7B在与GPT-3.5-turbo的对比中，获胜率达到56.36%（由GPT-4评估）。
多任务支持：支持多种数据处理任务，包括错误检测、数据插补、模式匹配、实体匹配、列类型注释和属性值提取等。

📦 安装指南

文档未提及具体安装步骤，可参考代码示例中的模型加载方式，若使用 transformers 库，模型会在未缓存时自动从HuggingFace模型中心下载；若使用 vllm，则需根据本地环境修改模型路径。

📚 详细文档

模型详情

Jellyfish-7B是一个拥有70亿参数的大语言模型。我们使用 Jellyfish-Instruct 数据集的一个子集，对 mistralai/Mistral-7B-Instruct-v0.2 模型进行了微调。

关于模型的更多详细信息可在 Jellyfish论文中找到。

属性	详情
开发者	Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
联系方式	dongyuyang@nec.com
资助方	NEC Corporation, Osaka University
支持语言	英语
许可证	非商业知识共享许可证 (CC BY-NC-4.0)
微调基础模型	mistralai/Mistral-7B-Instruct-v0.2

引用

如果您觉得我们的工作有用，请通过以下方式引用：

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}

已见任务的性能

任务	类型	数据集	非大语言模型最优方法¹	GPT-3.5²	GPT-4²	GPT-4o	Table-GPT	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
错误检测	已见	Adult	99.10	99.10	92.01	83.58	--	77.40	73.74	99.33
错误检测	已见	Hospital	94.40	97.80	90.74	44.76	--	94.51	93.40	95.59
错误检测	未见	Flights	81.00	--	83.48	66.01	--	69.15	66.21	82.52
错误检测	未见	Rayyan	79.00	--	81.95	68.53	--	75.07	81.06	90.65
数据插补	已见	Buy	96.50	98.50	100	100	--	98.46	98.46	100
数据插补	已见	Restaurant	77.20	88.40	97.67	90.70	--	89.53	87.21	89.53
数据插补	未见	Flipkart	68.00	--	89.94	83.20	--	87.14	87.48	81.68
数据插补	未见	Phone	86.70	--	90.79	86.78	--	86.52	85.68	87.21
模式匹配	已见	MIMIC-III	20.00	--	40.00	29.41	--	53.33	45.45	40.00
模式匹配	已见	Synthea	38.50	45.20	66.67	6.56	--	55.56	47.06	56.00
模式匹配	未见	CMS	50.00	--	19.35	22.22	--	42.86	38.10	59.29
实体匹配	已见	Amazon-Google	75.58	63.50	74.21	70.91	70.10	81.69	81.42	81.34
实体匹配	已见	Beer	94.37	100	100	90.32	96.30	100.00	100.00	96.77
实体匹配	已见	DBLP-ACM	98.99	96.60	97.44	95.87	93.80	98.65	98.77	98.98
实体匹配	已见	DBLP-GoogleScholar	95.70	83.80	91.87	90.45	92.40	94.88	95.03	98.51
实体匹配	已见	Fodors-Zagats	100	100	100	93.62	100	100	100	100
实体匹配	已见	iTunes-Amazon	97.06	98.20	100	98.18	94.30	96.30	96.30	98.11
实体匹配	未见	Abt-Buy	89.33	--	92.77	78.73	--	86.06	88.84	89.58
实体匹配	未见	Walmart-Amazon	86.89	87.00	90.27	79.19	82.40	84.91	85.24	89.42
平均			80.44	-	84.17	72.58	-	82.74	81.55	86.02

对于GPT-3.5和GPT-4，我们在所有数据集上使用了少样本方法。对于Jellyfish模型，在已见数据集上禁用少样本方法，在未见数据集上启用少样本方法。
数据插补任务使用准确率作为指标，其他任务使用F1分数。

HoloDetect 用于错误检测已见数据集
RAHA 用于错误检测未见数据集
IPM 用于数据插补
SMAT 用于模式匹配
Ditto 用于实体匹配
2.
Large Language Models as Data Preprocessors

未见任务的性能

列类型注释

数据集	RoBERTa (159 shots)¹	GPT-3.5¹	GPT-4	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
SOTAB	79.20	89.47	91.55	65.05	83	76.33	82

Jellyfish模型禁用少样本方法。

结果来自 Column Type Annotation using ChatGPT

属性值提取

数据集	Stable Beluga 2 70B¹	SOLAR 70B¹	GPT-3.5¹	GPT-4 ¹	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
AE-110k	52.10	49.20	61.30	55.50	55.77	56.09	59.55	58.12
OA-Mine	50.80	55.20	62.70	68.90	60.20	51.98	59.22	55.96

Jellyfish模型禁用少样本方法。

结果来自 Product Attribute Value Extraction using Large Language Models

提示模板

{system message}

[INST]:

{prompt} (without the {})

[\INST]]

训练详情

训练方法

我们使用LoRA加速训练过程，目标是q_proj、k_proj、v_proj和o_proj模块。

提示

我们提供了用于微调和平推推理的提示，您可以根据这些提示来构建数据。

系统消息

You are an AI assistant that follows instruction extremely well.
User will give you a question. Your task is to answer as faithfully as you can.

错误检测

错误检测任务有两种形式。第一种形式提供完整的记录行，任务是确定特定值是否错误。第二种形式仅给出特定属性的值，仅根据属性的名称和值来判断其正确性。以下提示示例分别对应这两种形式。

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

数据插补

You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.  
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.  
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?  
Answer only the value of {attribute X}.

模式匹配

Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].

实体匹配

You are tasked with determining whether two records listed below are the same based on the information provided.  
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.  
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity? Choose your answer from: [Yes, No].