模型简介
模型特点
模型能力
使用案例
🚀 Jellyfish-7B
Jellyfish-7B是一个拥有70亿参数的大语言模型。它基于特定数据集对基础模型进行微调,在数据预处理等任务上表现出色,能有效助力数据处理工作。
🚀 快速开始
若要加速推理过程,强烈建议使用 vLLM 运行Jellyfish模型。以下是两个简单的Python代码示例,用于使用Jellyfish模型进行推理:
基础用法
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
# Model will be automatically downloaded from HuggingFace model hub if not cached.
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
# You can also download the model manually and replace the model name with the path to the model files.
model = AutoModelForCausalLM.from_pretrained(
"NECOUDBFM/Jellyfish",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."
prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
# You can modify the sampling parameters according to your needs.
generation_config = GenerationConfig(
do_samples=True,
temperature=0.35,
top_p=0.9,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=1024,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.15,
)
output = generation_output[0]
response = tokenizer.decode(
output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()
print(response)
高级用法
from vllm import LLM, SamplingParams
# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
# You should modify the path to the model according to your local environment.
path_to_model = (
"/workspace/models/Jellyfish"
)
model = LLM(model=path_to_model)
# You can modify the sampling parameters according to your needs.
# Caution: The stop parameter should not be changed.
sampling_params = SamplingParams(
temperature=0.35,
top_p=0.9,
max_tokens=1024,
stop=["[INST]"],
)
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."
prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)
✨ 主要特性
- 高性能表现:在多个数据处理任务(如错误检测、数据插补、模式匹配和实体匹配等)上展现出良好的性能,部分任务表现优于GPT-3.5等模型。
- 可解释性强:Jellyfish-7B在与GPT-3.5-turbo的对比中,获胜率达到56.36%(由GPT-4评估)。
- 多任务支持:支持多种数据处理任务,包括错误检测、数据插补、模式匹配、实体匹配、列类型注释和属性值提取等。
📦 安装指南
文档未提及具体安装步骤,可参考代码示例中的模型加载方式,若使用 transformers
库,模型会在未缓存时自动从HuggingFace模型中心下载;若使用 vllm
,则需根据本地环境修改模型路径。
📚 详细文档
模型详情
Jellyfish-7B是一个拥有70亿参数的大语言模型。我们使用 Jellyfish-Instruct 数据集的一个子集,对 mistralai/Mistral-7B-Instruct-v0.2 模型进行了微调。
关于模型的更多详细信息可在 Jellyfish论文 中找到。
属性 | 详情 |
---|---|
开发者 | Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada |
联系方式 | dongyuyang@nec.com |
资助方 | NEC Corporation, Osaka University |
支持语言 | 英语 |
许可证 | 非商业知识共享许可证 (CC BY-NC-4.0) |
微调基础模型 | mistralai/Mistral-7B-Instruct-v0.2 |
引用
如果您觉得我们的工作有用,请通过以下方式引用:
@article{zhang2023jellyfish,
title={Jellyfish: A Large Language Model for Data Preprocessing},
author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
journal={arXiv preprint arXiv:2312.01678},
year={2023}
}
已见任务的性能
任务 | 类型 | 数据集 | 非大语言模型最优方法1 | GPT-3.52 | GPT-42 | GPT-4o | Table-GPT | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
---|---|---|---|---|---|---|---|---|---|---|
错误检测 | 已见 | Adult | 99.10 | 99.10 | 92.01 | 83.58 | -- | 77.40 | 73.74 | 99.33 |
错误检测 | 已见 | Hospital | 94.40 | 97.80 | 90.74 | 44.76 | -- | 94.51 | 93.40 | 95.59 |
错误检测 | 未见 | Flights | 81.00 | -- | 83.48 | 66.01 | -- | 69.15 | 66.21 | 82.52 |
错误检测 | 未见 | Rayyan | 79.00 | -- | 81.95 | 68.53 | -- | 75.07 | 81.06 | 90.65 |
数据插补 | 已见 | Buy | 96.50 | 98.50 | 100 | 100 | -- | 98.46 | 98.46 | 100 |
数据插补 | 已见 | Restaurant | 77.20 | 88.40 | 97.67 | 90.70 | -- | 89.53 | 87.21 | 89.53 |
数据插补 | 未见 | Flipkart | 68.00 | -- | 89.94 | 83.20 | -- | 87.14 | 87.48 | 81.68 |
数据插补 | 未见 | Phone | 86.70 | -- | 90.79 | 86.78 | -- | 86.52 | 85.68 | 87.21 |
模式匹配 | 已见 | MIMIC-III | 20.00 | -- | 40.00 | 29.41 | -- | 53.33 | 45.45 | 40.00 |
模式匹配 | 已见 | Synthea | 38.50 | 45.20 | 66.67 | 6.56 | -- | 55.56 | 47.06 | 56.00 |
模式匹配 | 未见 | CMS | 50.00 | -- | 19.35 | 22.22 | -- | 42.86 | 38.10 | 59.29 |
实体匹配 | 已见 | Amazon-Google | 75.58 | 63.50 | 74.21 | 70.91 | 70.10 | 81.69 | 81.42 | 81.34 |
实体匹配 | 已见 | Beer | 94.37 | 100 | 100 | 90.32 | 96.30 | 100.00 | 100.00 | 96.77 |
实体匹配 | 已见 | DBLP-ACM | 98.99 | 96.60 | 97.44 | 95.87 | 93.80 | 98.65 | 98.77 | 98.98 |
实体匹配 | 已见 | DBLP-GoogleScholar | 95.70 | 83.80 | 91.87 | 90.45 | 92.40 | 94.88 | 95.03 | 98.51 |
实体匹配 | 已见 | Fodors-Zagats | 100 | 100 | 100 | 93.62 | 100 | 100 | 100 | 100 |
实体匹配 | 已见 | iTunes-Amazon | 97.06 | 98.20 | 100 | 98.18 | 94.30 | 96.30 | 96.30 | 98.11 |
实体匹配 | 未见 | Abt-Buy | 89.33 | -- | 92.77 | 78.73 | -- | 86.06 | 88.84 | 89.58 |
实体匹配 | 未见 | Walmart-Amazon | 86.89 | 87.00 | 90.27 | 79.19 | 82.40 | 84.91 | 85.24 | 89.42 |
平均 | 80.44 | - | 84.17 | 72.58 | - | 82.74 | 81.55 | 86.02 |
对于GPT-3.5和GPT-4,我们在所有数据集上使用了少样本方法。对于Jellyfish模型,在已见数据集上禁用少样本方法,在未见数据集上启用少样本方法。
数据插补任务使用准确率作为指标,其他任务使用F1分数。
HoloDetect 用于错误检测已见数据集
RAHA 用于错误检测未见数据集
IPM 用于数据插补
SMAT 用于模式匹配
Ditto 用于实体匹配
2.
Large Language Models as Data Preprocessors
未见任务的性能
列类型注释
数据集 | RoBERTa (159 shots)1 | GPT-3.51 | GPT-4 | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
---|---|---|---|---|---|---|---|
SOTAB | 79.20 | 89.47 | 91.55 | 65.05 | 83 | 76.33 | 82 |
Jellyfish模型禁用少样本方法。
属性值提取
数据集 | Stable Beluga 2 70B1 | SOLAR 70B1 | GPT-3.51 | GPT-4 1 | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
---|---|---|---|---|---|---|---|---|
AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 55.77 | 56.09 | 59.55 | 58.12 |
OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 60.20 | 51.98 | 59.22 | 55.96 |
Jellyfish模型禁用少样本方法。
提示模板
{system message}
[INST]:
{prompt} (without the {})
[\INST]]
训练详情
训练方法
我们使用LoRA加速训练过程,目标是q_proj、k_proj、v_proj和o_proj模块。
提示
我们提供了用于微调和平推推理的提示,您可以根据这些提示来构建数据。
系统消息
You are an AI assistant that follows instruction extremely well.
User will give you a question. Your task is to answer as faithfully as you can.
错误检测
错误检测任务有两种形式。第一种形式提供完整的记录行,任务是确定特定值是否错误。第二种形式仅给出特定属性的值,仅根据属性的名称和值来判断其正确性。以下提示示例分别对应这两种形式。
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
数据插补
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
Answer only the value of {attribute X}.
模式匹配
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
实体匹配
You are tasked with determining whether two records listed below are the same based on the information provided.
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Are record A and record B the same entity? Choose your answer from: [Yes, No].
列类型注释
我们遵循 Column Type Annotation using ChatGPT 中的提示(text+inst+2-step)。
属性值提取
我们遵循 Product Attribute Value Extraction using Large Language Models 中的提示(文本形式,无示例)。
🔧 技术细节
使用LoRA对 mistralai/Mistral-7B-Instruct-v0.2 模型进行微调,目标模块为q_proj、k_proj、v_proj和o_proj,以加速训练过程。
📄 许可证
本模型使用非商业知识共享许可证 (CC BY-NC-4.0)。



