模型概述
模型特點
模型能力
使用案例
🚀 Jellyfish-7B
Jellyfish-7B是一個擁有70億參數的大語言模型。它基於特定數據集對基礎模型進行微調,在數據預處理等任務上表現出色,能有效助力數據處理工作。
🚀 快速開始
若要加速推理過程,強烈建議使用 vLLM 運行Jellyfish模型。以下是兩個簡單的Python代碼示例,用於使用Jellyfish模型進行推理:
基礎用法
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
# Model will be automatically downloaded from HuggingFace model hub if not cached.
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
# You can also download the model manually and replace the model name with the path to the model files.
model = AutoModelForCausalLM.from_pretrained(
"NECOUDBFM/Jellyfish",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."
prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
# You can modify the sampling parameters according to your needs.
generation_config = GenerationConfig(
do_samples=True,
temperature=0.35,
top_p=0.9,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=1024,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.15,
)
output = generation_output[0]
response = tokenizer.decode(
output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()
print(response)
高級用法
from vllm import LLM, SamplingParams
# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
# You should modify the path to the model according to your local environment.
path_to_model = (
"/workspace/models/Jellyfish"
)
model = LLM(model=path_to_model)
# You can modify the sampling parameters according to your needs.
# Caution: The stop parameter should not be changed.
sampling_params = SamplingParams(
temperature=0.35,
top_p=0.9,
max_tokens=1024,
stop=["[INST]"],
)
system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."
# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."
prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)
✨ 主要特性
- 高性能表現:在多個數據處理任務(如錯誤檢測、數據插補、模式匹配和實體匹配等)上展現出良好的性能,部分任務表現優於GPT-3.5等模型。
- 可解釋性強:Jellyfish-7B在與GPT-3.5-turbo的對比中,獲勝率達到56.36%(由GPT-4評估)。
- 多任務支持:支持多種數據處理任務,包括錯誤檢測、數據插補、模式匹配、實體匹配、列類型註釋和屬性值提取等。
📦 安裝指南
文檔未提及具體安裝步驟,可參考代碼示例中的模型加載方式,若使用 transformers
庫,模型會在未緩存時自動從HuggingFace模型中心下載;若使用 vllm
,則需根據本地環境修改模型路徑。
📚 詳細文檔
模型詳情
Jellyfish-7B是一個擁有70億參數的大語言模型。我們使用 Jellyfish-Instruct 數據集的一個子集,對 mistralai/Mistral-7B-Instruct-v0.2 模型進行了微調。
關於模型的更多詳細信息可在 Jellyfish論文 中找到。
屬性 | 詳情 |
---|---|
開發者 | Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada |
聯繫方式 | dongyuyang@nec.com |
資助方 | NEC Corporation, Osaka University |
支持語言 | 英語 |
許可證 | 非商業知識共享許可證 (CC BY-NC-4.0) |
微調基礎模型 | mistralai/Mistral-7B-Instruct-v0.2 |
引用
如果您覺得我們的工作有用,請通過以下方式引用:
@article{zhang2023jellyfish,
title={Jellyfish: A Large Language Model for Data Preprocessing},
author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
journal={arXiv preprint arXiv:2312.01678},
year={2023}
}
已見任務的性能
任務 | 類型 | 數據集 | 非大語言模型最優方法1 | GPT-3.52 | GPT-42 | GPT-4o | Table-GPT | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
---|---|---|---|---|---|---|---|---|---|---|
錯誤檢測 | 已見 | Adult | 99.10 | 99.10 | 92.01 | 83.58 | -- | 77.40 | 73.74 | 99.33 |
錯誤檢測 | 已見 | Hospital | 94.40 | 97.80 | 90.74 | 44.76 | -- | 94.51 | 93.40 | 95.59 |
錯誤檢測 | 未見 | Flights | 81.00 | -- | 83.48 | 66.01 | -- | 69.15 | 66.21 | 82.52 |
錯誤檢測 | 未見 | Rayyan | 79.00 | -- | 81.95 | 68.53 | -- | 75.07 | 81.06 | 90.65 |
數據插補 | 已見 | Buy | 96.50 | 98.50 | 100 | 100 | -- | 98.46 | 98.46 | 100 |
數據插補 | 已見 | Restaurant | 77.20 | 88.40 | 97.67 | 90.70 | -- | 89.53 | 87.21 | 89.53 |
數據插補 | 未見 | Flipkart | 68.00 | -- | 89.94 | 83.20 | -- | 87.14 | 87.48 | 81.68 |
數據插補 | 未見 | Phone | 86.70 | -- | 90.79 | 86.78 | -- | 86.52 | 85.68 | 87.21 |
模式匹配 | 已見 | MIMIC-III | 20.00 | -- | 40.00 | 29.41 | -- | 53.33 | 45.45 | 40.00 |
模式匹配 | 已見 | Synthea | 38.50 | 45.20 | 66.67 | 6.56 | -- | 55.56 | 47.06 | 56.00 |
模式匹配 | 未見 | CMS | 50.00 | -- | 19.35 | 22.22 | -- | 42.86 | 38.10 | 59.29 |
實體匹配 | 已見 | Amazon-Google | 75.58 | 63.50 | 74.21 | 70.91 | 70.10 | 81.69 | 81.42 | 81.34 |
實體匹配 | 已見 | Beer | 94.37 | 100 | 100 | 90.32 | 96.30 | 100.00 | 100.00 | 96.77 |
實體匹配 | 已見 | DBLP-ACM | 98.99 | 96.60 | 97.44 | 95.87 | 93.80 | 98.65 | 98.77 | 98.98 |
實體匹配 | 已見 | DBLP-GoogleScholar | 95.70 | 83.80 | 91.87 | 90.45 | 92.40 | 94.88 | 95.03 | 98.51 |
實體匹配 | 已見 | Fodors-Zagats | 100 | 100 | 100 | 93.62 | 100 | 100 | 100 | 100 |
實體匹配 | 已見 | iTunes-Amazon | 97.06 | 98.20 | 100 | 98.18 | 94.30 | 96.30 | 96.30 | 98.11 |
實體匹配 | 未見 | Abt-Buy | 89.33 | -- | 92.77 | 78.73 | -- | 86.06 | 88.84 | 89.58 |
實體匹配 | 未見 | Walmart-Amazon | 86.89 | 87.00 | 90.27 | 79.19 | 82.40 | 84.91 | 85.24 | 89.42 |
平均 | 80.44 | - | 84.17 | 72.58 | - | 82.74 | 81.55 | 86.02 |
對於GPT-3.5和GPT-4,我們在所有數據集上使用了少樣本方法。對於Jellyfish模型,在已見數據集上禁用少樣本方法,在未見數據集上啟用少樣本方法。
數據插補任務使用準確率作為指標,其他任務使用F1分數。
HoloDetect 用於錯誤檢測已見數據集
RAHA 用於錯誤檢測未見數據集
IPM 用於數據插補
SMAT 用於模式匹配
Ditto 用於實體匹配
2.
Large Language Models as Data Preprocessors
未見任務的性能
列類型註釋
數據集 | RoBERTa (159 shots)1 | GPT-3.51 | GPT-4 | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
---|---|---|---|---|---|---|---|
SOTAB | 79.20 | 89.47 | 91.55 | 65.05 | 83 | 76.33 | 82 |
Jellyfish模型禁用少樣本方法。
屬性值提取
數據集 | Stable Beluga 2 70B1 | SOLAR 70B1 | GPT-3.51 | GPT-4 1 | GPT-4o | Jellyfish-7B | Jellyfish-8B | Jellyfish-13B |
---|---|---|---|---|---|---|---|---|
AE-110k | 52.10 | 49.20 | 61.30 | 55.50 | 55.77 | 56.09 | 59.55 | 58.12 |
OA-Mine | 50.80 | 55.20 | 62.70 | 68.90 | 60.20 | 51.98 | 59.22 | 55.96 |
Jellyfish模型禁用少樣本方法。
提示模板
{system message}
[INST]:
{prompt} (without the {})
[\INST]]
訓練詳情
訓練方法
我們使用LoRA加速訓練過程,目標是q_proj、k_proj、v_proj和o_proj模塊。
提示
我們提供了用於微調和平推推理的提示,您可以根據這些提示來構建數據。
系統消息
You are an AI assistant that follows instruction extremely well.
User will give you a question. Your task is to answer as faithfully as you can.
錯誤檢測
錯誤檢測任務有兩種形式。第一種形式提供完整的記錄行,任務是確定特定值是否錯誤。第二種形式僅給出特定屬性的值,僅根據屬性的名稱和值來判斷其正確性。以下提示示例分別對應這兩種形式。
Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
數據插補
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
Answer only the value of {attribute X}.
模式匹配
Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
實體匹配
You are tasked with determining whether two records listed below are the same based on the information provided.
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Are record A and record B the same entity? Choose your answer from: [Yes, No].
列類型註釋
我們遵循 Column Type Annotation using ChatGPT 中的提示(text+inst+2-step)。
屬性值提取
我們遵循 Product Attribute Value Extraction using Large Language Models 中的提示(文本形式,無示例)。
🔧 技術細節
使用LoRA對 mistralai/Mistral-7B-Instruct-v0.2 模型進行微調,目標模塊為q_proj、k_proj、v_proj和o_proj,以加速訓練過程。
📄 許可證
本模型使用非商業知識共享許可證 (CC BY-NC-4.0)。



