Jellyfish-7B開源大語言模型 - 免費部署高效完成數據預處理，可解釋性強

首頁

Jellyfish 7B

由NECOUDBFM開發

水母-7B是一個基於Mistral-7B微調的70億參數大語言模型，專注於數據預處理任務，在可解釋性方面表現優異。

大型語言模型

Transformers

英語#數據預處理優化 #高解釋性語言模型 #表格數據處理

下載量 267

發布時間 : 2/28/2024

模型概述

水母-7B是在Mistral-7B-Instruct-v0.2基礎上使用Jellyfish-Instruct數據集微調的大語言模型，主要用於數據預處理相關任務，在錯誤檢測等任務上表現優於GPT-3.5-turbo。

模型特點

數據預處理優化

專門針對數據預處理任務進行微調，在錯誤檢測、數據補全等任務上表現優異

可解釋性強

在可解釋性方面對抗GPT-3.5-turbo的勝率達到56.36%

高效訓練

採用LoRA技術加速訓練過程，優化特定模塊

模型能力

錯誤檢測

數據補全

列類型標註

屬性值提取

文本生成

使用案例

數據清洗

數據集錯誤檢測

在Adult和Hospital等數據集上進行錯誤檢測

在Hospital數據集上達到94.51%的F1分數

數據補全

缺失值推斷

根據已有字段推斷缺失屬性值

🚀 Jellyfish-7B

Jellyfish-7B是一個擁有70億參數的大語言模型。它基於特定數據集對基礎模型進行微調，在數據預處理等任務上表現出色，能有效助力數據處理工作。

🚀 快速開始

若要加速推理過程，強烈建議使用 vLLM 運行Jellyfish模型。以下是兩個簡單的Python代碼示例，用於使用Jellyfish模型進行推理：

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Model will be automatically downloaded from HuggingFace model hub if not cached.
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
# You can also download the model manually and replace the model name with the path to the model files.
model = AutoModelForCausalLM.from_pretrained(
    "NECOUDBFM/Jellyfish",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

# You can modify the sampling parameters according to your needs.
generation_config = GenerationConfig(
    do_samples=True,
    temperature=0.35,
    top_p=0.9,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.15,
    )

output = generation_output[0]
response = tokenizer.decode(
    output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()

print(response)

高級用法

from vllm import LLM, SamplingParams

# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
# You should modify the path to the model according to your local environment.
path_to_model = (
    "/workspace/models/Jellyfish"
)

model = LLM(model=path_to_model)

# You can modify the sampling parameters according to your needs.
# Caution: The stop parameter should not be changed.
sampling_params = SamplingParams(
    temperature=0.35,
    top_p=0.9,
    max_tokens=1024,
    stop=["[INST]"],
)

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"{system_message}\n\n[INST]:\n\n{user_message}\n\n[\INST]]"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)

✨ 主要特性

高性能表現：在多個數據處理任務（如錯誤檢測、數據插補、模式匹配和實體匹配等）上展現出良好的性能，部分任務表現優於GPT-3.5等模型。
可解釋性強：Jellyfish-7B在與GPT-3.5-turbo的對比中，獲勝率達到56.36%（由GPT-4評估）。
多任務支持：支持多種數據處理任務，包括錯誤檢測、數據插補、模式匹配、實體匹配、列類型註釋和屬性值提取等。

📦 安裝指南

文檔未提及具體安裝步驟，可參考代碼示例中的模型加載方式，若使用 transformers 庫，模型會在未緩存時自動從HuggingFace模型中心下載；若使用 vllm，則需根據本地環境修改模型路徑。

📚 詳細文檔

模型詳情

Jellyfish-7B是一個擁有70億參數的大語言模型。我們使用 Jellyfish-Instruct 數據集的一個子集，對 mistralai/Mistral-7B-Instruct-v0.2 模型進行了微調。

關於模型的更多詳細信息可在 Jellyfish論文中找到。

屬性	詳情
開發者	Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
聯繫方式	dongyuyang@nec.com
資助方	NEC Corporation, Osaka University
支持語言	英語
許可證	非商業知識共享許可證 (CC BY-NC-4.0)
微調基礎模型	mistralai/Mistral-7B-Instruct-v0.2

引用

如果您覺得我們的工作有用，請通過以下方式引用：

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}

已見任務的性能

任務	類型	數據集	非大語言模型最優方法¹	GPT-3.5²	GPT-4²	GPT-4o	Table-GPT	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
錯誤檢測	已見	Adult	99.10	99.10	92.01	83.58	--	77.40	73.74	99.33
錯誤檢測	已見	Hospital	94.40	97.80	90.74	44.76	--	94.51	93.40	95.59
錯誤檢測	未見	Flights	81.00	--	83.48	66.01	--	69.15	66.21	82.52
錯誤檢測	未見	Rayyan	79.00	--	81.95	68.53	--	75.07	81.06	90.65
數據插補	已見	Buy	96.50	98.50	100	100	--	98.46	98.46	100
數據插補	已見	Restaurant	77.20	88.40	97.67	90.70	--	89.53	87.21	89.53
數據插補	未見	Flipkart	68.00	--	89.94	83.20	--	87.14	87.48	81.68
數據插補	未見	Phone	86.70	--	90.79	86.78	--	86.52	85.68	87.21
模式匹配	已見	MIMIC-III	20.00	--	40.00	29.41	--	53.33	45.45	40.00
模式匹配	已見	Synthea	38.50	45.20	66.67	6.56	--	55.56	47.06	56.00
模式匹配	未見	CMS	50.00	--	19.35	22.22	--	42.86	38.10	59.29
實體匹配	已見	Amazon-Google	75.58	63.50	74.21	70.91	70.10	81.69	81.42	81.34
實體匹配	已見	Beer	94.37	100	100	90.32	96.30	100.00	100.00	96.77
實體匹配	已見	DBLP-ACM	98.99	96.60	97.44	95.87	93.80	98.65	98.77	98.98
實體匹配	已見	DBLP-GoogleScholar	95.70	83.80	91.87	90.45	92.40	94.88	95.03	98.51
實體匹配	已見	Fodors-Zagats	100	100	100	93.62	100	100	100	100
實體匹配	已見	iTunes-Amazon	97.06	98.20	100	98.18	94.30	96.30	96.30	98.11
實體匹配	未見	Abt-Buy	89.33	--	92.77	78.73	--	86.06	88.84	89.58
實體匹配	未見	Walmart-Amazon	86.89	87.00	90.27	79.19	82.40	84.91	85.24	89.42
平均			80.44	-	84.17	72.58	-	82.74	81.55	86.02

對於GPT-3.5和GPT-4，我們在所有數據集上使用了少樣本方法。對於Jellyfish模型，在已見數據集上禁用少樣本方法，在未見數據集上啟用少樣本方法。
數據插補任務使用準確率作為指標，其他任務使用F1分數。

HoloDetect 用於錯誤檢測已見數據集
RAHA 用於錯誤檢測未見數據集
IPM 用於數據插補
SMAT 用於模式匹配
Ditto 用於實體匹配
2.
Large Language Models as Data Preprocessors

未見任務的性能

列類型註釋

數據集	RoBERTa (159 shots)¹	GPT-3.5¹	GPT-4	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
SOTAB	79.20	89.47	91.55	65.05	83	76.33	82

Jellyfish模型禁用少樣本方法。

結果來自 Column Type Annotation using ChatGPT

屬性值提取

數據集	Stable Beluga 2 70B¹	SOLAR 70B¹	GPT-3.5¹	GPT-4 ¹	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
AE-110k	52.10	49.20	61.30	55.50	55.77	56.09	59.55	58.12
OA-Mine	50.80	55.20	62.70	68.90	60.20	51.98	59.22	55.96

Jellyfish模型禁用少樣本方法。

結果來自 Product Attribute Value Extraction using Large Language Models

提示模板

{system message}

[INST]:

{prompt} (without the {})

[\INST]]

訓練詳情

訓練方法

我們使用LoRA加速訓練過程，目標是q_proj、k_proj、v_proj和o_proj模塊。

提示

我們提供了用於微調和平推推理的提示，您可以根據這些提示來構建數據。

系統消息

You are an AI assistant that follows instruction extremely well.
User will give you a question. Your task is to answer as faithfully as you can.

錯誤檢測

錯誤檢測任務有兩種形式。第一種形式提供完整的記錄行，任務是確定特定值是否錯誤。第二種形式僅給出特定屬性的值，僅根據屬性的名稱和值來判斷其正確性。以下提示示例分別對應這兩種形式。

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

數據插補

You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
Your task is to deduce or infer the value of {attribute X} using the available information in the record.  
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.  
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?  
Answer only the value of {attribute X}.

模式匹配

Your task is to determine if the two attributes (columns) are semantically equivalent in the context of merging two tables.
Each attribute will be provided by its name and a brief description.
Your goal is to assess if they refer to the same information based on these names and descriptions provided.
Attribute A is [name: {value of name}, description: {value of description}].
Attribute B is [name: {value of name}, description: {value of description}].
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].

實體匹配

You are tasked with determining whether two records listed below are the same based on the information provided.  
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.  
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]  
Are record A and record B the same entity? Choose your answer from: [Yes, No].