アブレーションモデル FineWeb Edu オープンソースの英文テキスト補完モデル

ホーム

Ablation Model Fineweb Edu

HuggingFaceFWによって開発

このモデルはFineWebアブレーション研究の一部で、パラメータ数は18.2億、Llamaアーキテクチャを基にし、FineWeb-Eduデータセットでトレーニングされ、英文テキスト補完タスクに適しています。

大規模言語モデル

Transformers

英語オープンソースライセンス:Apache-2.0 #アブレーションモデル #英文テキスト補完 #Llamaアーキテクチャ

ダウンロード数 262

リリース時間 : 5/29/2024

モデル概要

このモデルはFineWebデータセットの効果を研究するためのアブレーションモデルで、主に英文テキスト生成と補完タスクに使用され、命令微調整はされていません。

モデル特徴

アブレーションモデル

FineWebデータセットの異なる設定がモデル性能に与える影響を研究するために特別に設計

大規模コンテキストウィンドウ

2048トークンのコンテキスト長をサポート

透明なトレーニングプロセス

1000トレーニングステップごとの中間チェックポイントを提供し、トレーニングダイナミクスの研究を容易に

モデル能力

英文テキスト生成

テキスト補完

言語モデル研究

使用事例

研究用途

データセットアブレーション研究

異なるデータ前処理方法がモデル性能に与える影響を比較するために使用

テキスト生成

英文テキスト補完

与えられたプレフィックスから一貫性のある後続テキストを生成

🚀 HuggingFaceFW/ablation-model-fineweb-edu

このモデルは、英語のウェブデータを用いて学習されたモデルで、テキストの自動生成や補完に利用できます。また、同じ条件で学習された他のモデルとの性能比較にも役立ちます。

🚀 クイックスタート

このセクションでは、モデルの基本的な使い方を説明します。

モデルの概要

このモデルは、FineWeb のアブレーション研究の一部であり、詳細はこの技術レポートに記載されています。モデルは18.2億個のパラメータを持ち、コンテキスト長は2048で、RoPEを用いたLlamaアーキテクチャを採用しています。FineWeb-Edu の3500億トークンを使用して学習され、gpt2 トークナイザーでトークン化されています。

属性	详情
モデルタイプ	Llamaモデル
学習データ	FineWeb-Edu の3500億トークン
論文	üç∑ FineWeb: decanting the web for the finest text data at scale https://hf.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
ライセンス	Apache-2
言語	英語

想定される使用方法

このモデルは英語のウェブデータを用いて学習されており、命令調整（instruction-tuning）されていないため、英語のテキスト補完に使用されることを想定しています。重要なのは、このモデルの主な使用目的は、同じ条件で学習された他のモデルとの性能比較であることです。このモデルは必ずしも与えられたデータセットで達成可能な最良の結果ではないことに注意してください。

生成

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # GPUを使用する場合は "cuda"、CPUを使用する場合は "cpu"

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

✨ 主な機能

中間チェックポイント（近日公開）

このモデルの中間チェックポイントを、学習ステップ1000ごとに別のブランチで公開しています。命名規則は step-001000-2BT です。

transformers を使用して、revision 引数を指定することで特定のモデルリビジョンを読み込むことができます。

model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")

以下のコードを使用して、モデルのすべてのリビジョンにアクセスできます。

from huggingface_hub import list_repo_refs
out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
print([b.name for b in out.branches])

🔧 技術詳細

学習

モデル

アーキテクチャ: Llamaモデル
事前学習ステップ: 16.7万
事前学習トークン: 3500億
精度: bfloat16

ハードウェア

GPU: 64台のH100
学習時間: 72時間

ソフトウェア

nanotron （学習用）
datatrove （トークン化用）
lighteval （評価用）

評価

lighteval を使用してすべてのアブレーションモデルを評価しました。結果を再現するには、こちらの指示に従ってください。

# download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
    --custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \ 
    --tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"

特に、MMLUのプロンプトは lm-evaluation-harness やOpen LLM Leaderboardのものとは若干異なります。詳細はこのブログ記事を参照してください。私たちは、小規模で命令調整されていないモデルに対してより良いシグナルを提供するプロンプトテンプレートを使用しています。