ablation-model-fineweb-edu Open-source English Text Completion Model

Ablation Model Fineweb Edu

Developed by HuggingFaceFW

This model is part of the FineWeb ablation experiment, with 1.82 billion parameters, based on the Llama architecture, trained using the FineWeb-Edu dataset, and suitable for English text completion tasks.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Ablation experiment model #English text completion #Llama architecture

Downloads 262

Release Time : 5/29/2024

Model Overview

This model is an ablation experiment model designed to study the effects of the FineWeb dataset, primarily used for English text generation and completion tasks, without instruction fine-tuning.

Model Features

Ablation experiment model

Specially designed to study the impact of different configurations of the FineWeb dataset on model performance

Large context window

Supports a context length of 2048 tokens

Transparent training process

Provides intermediate checkpoints every 1000 training steps for studying training dynamics

Model Capabilities

English text generation

Text completion

Language model research

Use Cases

Research purposes

Dataset ablation study

Used to compare the effects of different data preprocessing methods on model performance

Text generation

English text completion

Generates coherent subsequent text based on given prefixes

🚀 Model Card for HuggingFaceFW/ablation-model-fineweb-edu

This model card provides an overview of the HuggingFaceFW/ablation-model-fineweb-edu, including its summary, usage, training details, evaluation, and limitations.

✨ Features

Part of the FineWeb ablations.
Uses Llama architecture with RoPE.
Trained on 350B tokens from FineWeb-Edu.
Suitable for English text completion.

📦 Installation

To use this model, you need to install the transformers library. You can install it using the following command:

pip install -q transformers

💻 Usage Examples

Basic Usage

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "HuggingFaceFW/ablation-model-fineweb-edu"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Machine Learning is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Advanced Usage

You can load a specific model revision with transformers using the argument revision:

model = AutoModelForCausalLM.from_pretrained("HuggingFaceFW/ablation-model-fineweb-edu", revision="step-001000-2BT")

You can access all the revisions for the models via the following code:

from huggingface_hub import list_repo_refs
out = list_repo_refs("HuggingFaceFW/ablation-model-fineweb-edu")
print([b.name for b in out.branches])

📚 Documentation

Model summary

This model is part of the FineWeb ablations, detailed in this technical report. The model has 1.82B parameters, 2048 context length and uses Llama architecture with RoPE. It was trained on 350B tokens from FineWeb-Edu, tokenized using gpt2 tokenizer.

Paper: FineWeb: decanting the web for the finest text data at scale
License: Apache-2
Languages: English

Intended use

This model was trained on English web data and is not instruction-tuned, making it intended for text completion in English. It is important to note that the primary intended use case of this model is to compare its performance with other models trained under the same conditions. This model is not necessarily the best possible outcome achievable with the given dataset.

Intermediate checkpoints (soon)

We are releasing intermediate checkpoints for this model at intervals of every 1000 training steps in separate branches. The naming convention is step-001000-2BT.

Training

Model

Property	Details
Architecture	Llama model
Pretraining steps	167k
Pretraining tokens	350B
Precision	bfloat16

Hardware

Property	Details
GPUs	64 H100
Training time	72 wall clock hours

Software

nanotron for training
datatrove for tokenization
lighteval for evaluation

Evaluation

We used the same setup to evaluate all our ablation models with lighteval. To reproduce our numbers, make sure to follow the instruction here.

# download https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py and run:
accelerate launch --num_processes=1 lighteval/run_evals_accelerate.py --model_args="pretrained=HuggingFaceFW/ablation-model-fineweb-edu" \
    --custom_tasks "lighteval_tasks.py" --output_dir [OUTPUTPATH] --max_samples 1000 \ 
    --tasks "custom|hellaswag|0|1,custom|winogrande|0|1,custom|piqa|0|1,custom|siqa|0|1,custom|openbookqa|0|1,custom|arc:easy|0|1,custom|arc:challenge|0|1,custom|commonsense_qa|0|1,custom|mmlu:abstract_algebra|0|1,custom|mmlu:anatomy|0|1,custom|mmlu:astronomy|0|1,custom|mmlu:business_ethics|0|1,custom|mmlu:clinical_knowledge|0|1,custom|mmlu:college_biology|0|1,custom|mmlu:college_chemistry|0|1,custom|mmlu:college_computer_science|0|1,custom|mmlu:college_mathematics|0|1,custom|mmlu:college_medicine|0|1,custom|mmlu:college_physics|0|1,custom|mmlu:computer_security|0|1,custom|mmlu:conceptual_physics|0|1,custom|mmlu:econometrics|0|1,custom|mmlu:electrical_engineering|0|1,custom|mmlu:elementary_mathematics|0|1,custom|mmlu:formal_logic|0|1,custom|mmlu:global_facts|0|1,custom|mmlu:high_school_biology|0|1,custom|mmlu:high_school_chemistry|0|1,custom|mmlu:high_school_computer_science|0|1,custom|mmlu:high_school_european_history|0|1,custom|mmlu:high_school_geography|0|1,custom|mmlu:high_school_government_and_politics|0|1,custom|mmlu:high_school_macroeconomics|0|1,custom|mmlu:high_school_mathematics|0|1,custom|mmlu:high_school_microeconomics|0|1,custom|mmlu:high_school_physics|0|1,custom|mmlu:high_school_psychology|0|1,custom|mmlu:high_school_statistics|0|1,custom|mmlu:high_school_us_history|0|1,custom|mmlu:high_school_world_history|0|1,custom|mmlu:human_aging|0|1,custom|mmlu:human_sexuality|0|1,custom|mmlu:international_law|0|1,custom|mmlu:jurisprudence|0|1,custom|mmlu:logical_fallacies|0|1,custom|mmlu:machine_learning|0|1,custom|mmlu:management|0|1,custom|mmlu:marketing|0|1,custom|mmlu:medical_genetics|0|1,custom|mmlu:miscellaneous|0|1,custom|mmlu:moral_disputes|0|1,custom|mmlu:moral_scenarios|0|1,custom|mmlu:nutrition|0|1,custom|mmlu:philosophy|0|1,custom|mmlu:prehistory|0|1,custom|mmlu:professional_accounting|0|1,custom|mmlu:professional_law|0|1,custom|mmlu:professional_medicine|0|1,custom|mmlu:professional_psychology|0|1,custom|mmlu:public_relations|0|1,custom|mmlu:security_studies|0|1,custom|mmlu:sociology|0|1,custom|mmlu:us_foreign_policy|0|1,custom|mmlu:virology|0|1,custom|mmlu:world_religions|0|1"

In particular the MMLU prompts are slightly different from those in lm-evaluation-harness and the Open LLM Leaderboard, more in this blogpost. We use prompt templates that provide better signal for small and non instruction tuned models.

Limitations

This model was predominantly trained on English data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content.

📄 License

This model is released under the Apache-2 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご