ジェリーフィッシュ-13Bオープンソース大規模言語モデル - 無料デプロイでデータ前処理（エラー検出など）を支援

ホーム

Jellyfish 13B

NECOUDBFMによって開発

ジェリーフィッシュ-13Bは130億パラメータの大規模言語モデルで、エラー検出、データ補完、パターンマッチング、エンティティマッチングなどのデータ前処理タスクに特化しています。

大規模言語モデル

Transformers

英語#データ前処理の専門家 #マルチタスクデータ処理 #ローカルでの効率的な実行

ダウンロード数 102

リリース時間 : 10/16/2023

モデル概要

Open-Orca/OpenOrca-Platypus2-13Bモデルを微調整し、データ前処理タスクに焦点を当て、GPT-3.5やGPT-4に匹敵する性能を発揮し、ローカルで経済的に実行可能でデータセキュリティを保証します。

モデル特徴

データ前処理の専門家

データクリーニングと前処理タスクに特化して最適化され、さまざまなデータタスクで優れたパフォーマンスを発揮します

ローカルでの効率的な実行

13B規模のモデルをローカルに展開可能で、性能とリソース消費のバランスを取ります

デュアルバージョンデザイン

標準版とインタプリタ版を提供し、それぞれシステム統合とエンドユーザー使用に適しています

モデル能力

エラー検出

データ補完

パターンマッチング

エンティティマッチング

列タイプの注釈付け

属性値の抽出

使用事例

データ品質管理

データセットのエラー検出

データセット内のエラー値や異常値を識別します

Hospitalデータセットで95.59%のF1スコアを達成

欠損値の補完

データセット内の欠損値を自動的に補完します

Buyデータセットで100%の精度を達成

データ統合

エンティティマッチング

異なるデータソース内で同一のエンティティを指すレコードを識別します

DBLP-GoogleScholarデータセットで98.51%のF1スコアを達成

🚀 Jellyfish-13B

Jellyfish-13Bは、データ前処理タスクに特化した大規模言語モデルです。エラー検出、データ補完、スキーママッチング、エンティティマッチングなどのタスクで高い性能を発揮します。また、軽量版のJellyfish-7BとJellyfish-8Bも提供しています。

😄 一般化能力と推論能力が高いため、7Bと8Bモデルの使用を強くおすすめします！

✨ 主な機能

データ前処理タスクに特化した大規模言語モデル
エラー検出、データ補完、スキーママッチング、エンティティマッチングなどのタスクで高い性能を発揮
軽量版のJellyfish-7BとJellyfish-8Bも提供
ローカルでの実行が可能で、データセキュリティを確保
NLPタスクでも強い性能を維持

📦 インストール

推論を高速化するために、vLLMを使用してJellyfishを実行することを強くおすすめします。

💻 使用例

基本的な使用法

TransformersとTorchモジュールを使用する場合

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Model will be automatically downloaded from HuggingFace model hub if not cached.
# Model files will be cached in "~/.cache/huggingface/hub/models--NECOUDBFM--Jellyfish/" by default.
# You can also download the model manually and replace the model name with the path to the model files.
model = AutoModelForCausalLM.from_pretrained(
    "NECOUDBFM/Jellyfish",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NECOUDBFM/Jellyfish")

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"{system_message}\n\n### Instruction:\n\n{user_message}\n\n### Response:\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)

# You can modify the sampling parameters according to your needs.
generation_config = GenerationConfig(
    do_samples=True,
    temperature=0.35,
    top_p=0.9,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=1024,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.15,
    )

output = generation_output[0]
response = tokenizer.decode(
    output[:, input_ids.shape[-1] :][0], skip_special_tokens=True
).strip()

print(response)

vLLMを使用する場合

from vllm import LLM, SamplingParams

# To use vllm for inference, you need to download the model files either using HuggingFace model hub or manually.
# You should modify the path to the model according to your local environment.
path_to_model = (
    "/workspace/models/Jellyfish"
)

model = LLM(model=path_to_model)

# You can modify the sampling parameters according to your needs.
# Caution: The stop parameter should not be changed.
sampling_params = SamplingParams(
    temperature=0.35,
    top_p=0.9,
    max_tokens=1024,
    stop=["### Instruction:"],
)

system_message = "You are an AI assistant that follows instruction extremely well. Help as much as you can."

# You need to define the user_message variable based on the task and the data you want to test on.
user_message = "Hello, world."

prompt = f"{system_message}\n\n### Instruction:\n\n{user_message}\n\n### Response:\n\n"
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text.strip()
print(response)

高度な使用法

エラー検出の場合

Your task is to determine if there is an error in the value of a specific attribute within the whole record provided.
The attributes may include {attribute 1}, {attribute 2}, ...
Errors may include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense given the context of the whole record.
Record [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

Your task is to determine if there is an error in the value of a specific attribute.
The attributes may belong to a {keyword} record and could be one of the following: {attribute 1}, {attribute 2}, ...
Errors can include, but are not limited to, spelling errors, inconsistencies, or values that don't make sense for that attribute.  
Note: Missing values (N/A or \"nan\") are not considered errors.
Attribute for Verification: [{attribute X}: {attribute X value}]
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].

データ補完の場合

You are presented with a {ke

📚 ドキュメント

モデル詳細

Jellyfish-13Bは、130億パラメータの大規模言語モデルです。データ前処理タスクに特化しており、エラー検出、データ補完、スキーママッチング、エンティティマッチングなどのタスクに使用できます。

Open-Orca/OpenOrca-Platypus2-13Bモデルを、データ前処理タスクに関連するデータセットを使用してファインチューニングしました。その性能は、以前の最先端のアルゴリズムやLLM（OpenAIのGPT 3.5やGPT 4など）と匹敵します（以前の研究で実証）。

13Bモデルであるため、データセキュリティを損なうことなく、コスト効率の良いローカル実行が可能です。また、データ前処理タスクを扱う能力が高いため、LLMとしてNLPタスクでも強い性能を維持します。これは、JellyfishとOpenOraca-Platypus2のNLPベンチマークスコアの比較からもわかります。

Jellyfishには、Jellyfish-13B（メインブランチ）とJellyfish-13B-Interpreter（代替ブランチ）の2つのバージョンをリリースしています。Jellyfish-13Bは、正確で簡潔な回答を提供するように設計されています。一方、Jellyfish-13B-Interpreterは、データ前処理タスクを処理するための推論と逐次思考プロセスを含むデータでファインチューニングされており、GPT-4から知識を抽出しています。

2つのバージョンは、異なるアプリケーションシナリオに設計されています。Jellyfish-13Bは、シンプルで明確な回答がデータ管理/分析パイプラインのコードに容易に変換できるため、より大きなデータ管理システムに統合するのに適しています。一方、Jellyfish-13B-Interpreterは、高度なコーディングスキルや複雑な統計知識を必要とせずに、データの洞察を深く提供する回答が可能なため、ユーザー向けに設計されています。

モデルの詳細については、Jellyfish論文を参照してください。

性能評価

既知のタスクでの性能

タスク	タイプ	データセット	非LLMの最先端技術¹	GPT-3.5²	GPT-4²	GPT-4o	Table-GPT	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
エラー検出	既知	Adult	99.10	99.10	92.01	83.58	--	77.40	73.74	99.33
エラー検出	既知	Hospital	94.40	97.80	90.74	44.76	--	94.51	93.40	95.59
エラー検出	未知	Flights	81.00	--	83.48	66.01	--	69.15	66.21	82.52
エラー検出	未知	Rayyan	79.00	--	81.95	68.53	--	75.07	81.06	90.65
データ補完	既知	Buy	96.50	98.50	100	100	--	98.46	98.46	100
データ補完	既知	Restaurant	77.20	88.40	97.67	90.70	--	89.53	87.21	89.53
データ補完	未知	Flipkart	68.00	--	89.94	83.20	--	87.14	87.48	81.68
データ補完	未知	Phone	86.70	--	90.79	86.78	--	86.52	85.68	87.21
スキーママッチング	既知	MIMIC-III	20.00	--	40.00	29.41	--	53.33	45.45	40.00
スキーママッチング	既知	Synthea	38.50	45.20	66.67	6.56	--	55.56	47.06	56.00
スキーママッチング	未知	CMS	50.00	--	19.35	22.22	--	42.86	38.10	59.29
エンティティマッチング	既知	Amazon-Google	75.58	63.50	74.21	70.91	70.10	81.69	81.42	81.34
エンティティマッチング	既知	Beer	94.37	100	100	90.32	96.30	100.00	100.00	96.77
エンティティマッチング	既知	DBLP-ACM	98.99	96.60	97.44	95.87	93.80	98.65	98.77	98.98
エンティティマッチング	既知	DBLP-GoogleScholar	95.70	83.80	91.87	90.45	92.40	94.88	95.03	98.51
エンティティマッチング	既知	Fodors-Zagats	100	100	100	93.62	100	100	100	100
エンティティマッチング	既知	iTunes-Amazon	97.06	98.20	100	98.18	94.30	96.30	96.30	98.11
エンティティマッチング	未知	Abt-Buy	89.33	--	92.77	78.73	--	86.06	88.84	89.58
エンティティマッチング	未知	Walmart-Amazon	86.89	87.00	90.27	79.19	82.40	84.91	85.24	89.42
平均			80.44	-	84.17	72.58	-	82.74	81.55	86.02

For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets.
Accuracy as the metric for data imputation and the F1 score for other tasks.

HoloDetect for Error Detection seen datasets RAHA for Error Detection unseen datasets IPM for Data Imputation SMAT for Schema Matching Ditto for Entity Matching 2.
Large Language Models as Data Preprocessors

未知のタスクでの性能

列タイプアノテーション

データセット	RoBERTa (159 shots)¹	GPT-3.5¹	GPT-4	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
SOTAB	79.20	89.47	91.55	65.05	83	76.33	82

Few-shot is disabled for Jellyfish models.

Results from Column Type Annotation using ChatGPT

属性値抽出

データセット	Stable Beluga 2 70B¹	SOLAR 70B¹	GPT-3.5¹	GPT-4 ¹	GPT-4o	Jellyfish-7B	Jellyfish-8B	Jellyfish-13B
AE-110k	52.10	49.20	61.30	55.50	55.77	56.09	59.55	58.12
OA-Mine	50.80	55.20	62.70	68.90	60.20	51.98	59.22	55.96

Few-shot is disabled for Jellyfish models.

Results from Product Attribute Value Extraction using Large Language Models

プロンプトテンプレート

### Instruction:

<prompt> (without the <>)

### Response:

トレーニング詳細

トレーニングデータ

Can Foundation Models Wrangle Your Data?論文のトレーニングセットと検証セットを使用して、Jellyfishをファインチューニングしました。元のデータセットは、HazyResearch/fm_data_tasks、RAHA、SMAT、IPMから取得しました。これらのデータセットに基づいて、OpenOrcaデータセットのスタイルを模倣した、LLMのファインチューニング用の命令調整データセットを構築しました。

トレーニング方法

トレーニングプロセスを高速化するために、LoRAを使用し、q_proj、k_proj、v_proj、o_projモジュールを対象としました。

🔧 技術詳細

モデルタイプ: 大規模言語モデル
パラメータ数: 130億
ファインチューニング元のモデル: Open-Orca/OpenOrca-Platypus2-13B
トレーニングデータセット: Can Foundation Models Wrangle Your Data?のトレーニングセットと検証セット
トレーニング方法: LoRA

📄 ライセンス

Non-Commercial Creative Commons license (CC BY-NC-4.0)

引用

If you find our work useful, please give us credit by citing:

@article{zhang2023jellyfish,
  title={Jellyfish: A Large Language Model for Data Preprocessing},
  author={Zhang, Haochen and Dong, Yuyang and Xiao, Chuan and Oyamada, Masafumi},
  journal={arXiv preprint arXiv:2312.01678},
  year={2023}
}