đ Llama-3-Instruct-8B-SPPO-Iter3
This model, Llama-3-Instruct-8B-SPPO-Iter3, is designed for text generation. It uses the Self-Play Preference Optimization method to fine - tune the model, aiming to improve the alignment and performance of the language model.
đ Quick Start
This README provides detailed information about the Llama-3-Instruct-8B-SPPO-Iter3 model, including its development, evaluation results, and training hyperparameters.
⨠Features
- Optimized Architecture: Developed based on the meta-llama/Meta-Llama-3-8B-Instruct architecture, with further optimization through Self - Play Preference Optimization.
- Diverse Datasets: Utilizes prompt sets from the openbmb/UltraFeedback dataset, which is split for multiple iterations.
- Synthetic Responses: All responses used during development are synthetic.
đ Documentation
Model Development
This model was developed using Self-Play Preference Optimization at iteration 3. Starting from the meta-llama/Meta-Llama-3-8B-Instruct architecture, we used the prompt sets from the openbmb/UltraFeedback dataset. The dataset was split into 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.
Links to Other Models
Model Description
Property |
Details |
Model Type |
A 8B parameter GPT - like model fine - tuned on synthetic datasets. |
Language(s) (NLP) |
Primarily English |
License |
Apache - 2.0 |
Finetuned from model |
meta-llama/Meta-Llama-3-8B-Instruct |
Evaluation Results
Results are reported by using lm-evaluation-harness v0.4.1
Detailed results can be found here
Metric |
Value |
Avg. |
23.68 |
IFEval (0 - Shot) |
68.28 |
BBH (3 - Shot) |
29.74 |
MATH Lvl 5 (4 - Shot) |
7.33 |
GPQA (0 - shot) |
2.01 |
MuSR (0 - shot) |
3.09 |
MMLU - PRO (5 - shot) |
29.38 |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 07
- eta: 1000
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 1
- seed: 42
- distributed_type: deepspeed_zero3
- num_devices: 8
- optimizer: RMSProp
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_train_epochs: 6.0 (stop at epoch = 1.0)
đ License
This model is licensed under the Apache - 2.0 license.
đ§ Technical Details
This model is based on the Self - Play Preference Optimization method, which is described in the paper Self-Play Preference Optimization for Language Model Alignment. The method aims to improve the alignment of language models by optimizing the preference of self - play responses.
đ Citation
@misc{wu2024self,
title={Self-Play Preference Optimization for Language Model Alignment},
author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
year={2024},
eprint={2405.00675},
archivePrefix={arXiv},
primaryClass={cs.LG}
}