ablation-model-fineweb-edu開源英文文本補全模型

首頁

Ablation Model Fineweb Edu

由HuggingFaceFW開發

該模型是FineWeb消融實驗的一部分，參數為18.2億，基於Llama架構，使用FineWeb-Edu數據集訓練，適用於英文文本補全任務。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #消融實驗模型 #英文文本補全 #Llama架構

下載量 262

發布時間 : 5/29/2024

模型概述

該模型是用於研究FineWeb數據集效果的消融實驗模型，主要用於英文文本生成和補全任務，未經指令微調。

模型特點

消融實驗模型

專門設計用於研究FineWeb數據集不同配置對模型性能的影響

大上下文窗口

支持2048 tokens的上下文長度

透明訓練過程

提供每1000訓練步的中間檢查點，便於研究訓練動態

模型能力

英文文本生成

文本補全

語言模型研究

使用案例

研究用途

數據集消融研究

用於比較不同數據預處理方法對模型性能的影響

文本生成

英文文本補全

根據給定前綴生成連貫的後續文本

🚀 HuggingFaceFW/ablation-model-fineweb-edu 模型卡

本模型是基於Transformer架構開發的語言模型，主要用於英文文本補全任務。通過在特定英文數據集上訓練，該模型可用於與其他相同條件下訓練的模型進行性能比較。

🚀 快速開始

安裝依賴

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

加載特定版本模型

model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")

獲取所有模型版本

from huggingface_hub import list_repo_refs
out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
print([b.name for b in out.branches])

✨ 主要特性

模型參數：擁有18.2億個參數，上下文長度為2048。
架構：採用Llama架構並使用RoPE。
訓練數據：在包含3500億個標記的 FineWeb-Edu 數據集上進行訓練，使用gpt2分詞器進行分詞。

📦 安裝指南

運行代碼示例前，請確保安裝transformers庫：

pip install -q transformers

💻 使用示例

基礎用法

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

高級用法

加載特定版本的模型：

model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")

📚 詳細文檔

預期用途

此模型在英文網絡數據上進行訓練，且未經過指令微調，主要用於英文文本補全。需要注意的是，該模型的主要預期用例是與在相同條件下訓練的其他模型進行性能比較，它不一定是給定數據集所能達到的最佳結果。

中間檢查點（即將發佈）

我們將以每1000個訓練步驟為間隔，在單獨的分支中發佈該模型的中間檢查點。命名約定為 step-001000-2BT。

訓練

模型

屬性	詳情
模型類型	Llama模型
預訓練步驟	16.7萬步
預訓練標記	3500億個
精度	bfloat16

硬件

屬性	詳情
GPU	64個H100
訓練時間	72小時

軟件

nanotron 用於訓練
datatrove 用於分詞
lighteval 用於評估

評估

我們使用lighteval以相同的設置評估所有消融模型。要重現我們的結果，請確保遵循此處的說明。

# download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
    --custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \ 
    --tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"

特別地，MMLU提示與lm-evaluation-harness和開放大語言模型排行榜中的提示略有不同，更多信息請參閱此博客文章。我們使用的提示模板為小型且未經過指令微調的模型提供了更好的信號。